## 1. Introduction

The atmospheric movements can be described by non-linear differential equations that unfortunately have no analytical solution. The numerical methods to solve the atmospheric non-linear differential equations have been developed in different stages. During the 50s, Charney, Fjørtoft and von Neumann (1950) made a 24-hour forecast of 500 hPa geopotential height using a bidimensional quasi-geostrophic equation. After that, in 1956, Philips showed the close relation between cyclone dynamics and the global circulation using a 2-layer model.

At the beginning of the 70s, the global circulation models emerged (Lynch, 2006). These models are based on a set of non-linear differential equations, which are used to approximate the global atmospheric flow, called primitive equations. During this stage the full primitive equations were implemented without any quasi-geostrophic approximation (Williamson, 2007).

During the 80s, the regional and mesoscale numerical models appeared (Athens &Warner, 1978; Mesinger et al., 1988). The evolution of the models is a direct consequence of the increase of computer resources, and the improvement in observational networks and assimilation methods. This evolution has extended the knowledge on the dynamics and atmospheric microphysical processes.

The last period of the numerical weather prediction was initiated in the 90s. The atmosphere-ocean and atmosphere-ocean-soil coupled models, and the spatio-temporal high resolution models allowed the development of analysis and diagnostic techniques for the weather forecasting (Mechoso & Arakawa, 2003).

Until then, the numerical prediction models’ philosophy was based on the deterministic atmospheric behavior. That means, given an atmospheric initial state its evolution can be numerically predicted to give a unique final state. Consequently the efforts of the scientific community were focused on producing the most accurate prediction (Tracton & Kalnay, 1993).

Nevertheless, the formulation of models requires approximations due to unknown variables or known process that cannot be explicitly resolved using the spatio-temporal resolutions a model works with. These processes must be parameterized and this fact generates errors associated with the parameterization used in the model. Although the model could perfectly simulate all the atmospheric processes, it would be impossible to determine a realistic initial state description for all resolutions and in all places using the available observational data (Daley, 1991). Lorenz (1963) showed that small variations on the model initial conditions do not produce a single final solution, but a set of different possible solutions. That is why the predictability of the future atmospheric states is limited in time: the initial condition errors are amplified as the forecast period grows (Lorenz, 1963, 1969).

The traditional deterministic approach gave way to a new paradigm, with richer information than a single solution of the future atmospheric state. The new paradigm includes quantitative information about the uncertainty of the predictive process. The atmospheric non-linear behavior, consequently chaotic, must be treated now in a probabilistic way (Lorenz, 1963).

The improvement of numerical models will permit a better characterization of the atmospheric processes but the models will always have some limitations related to the scales of the simulated processes and the approximations made to solve numerically the equations. Another limitation of the numerical forecasting methods is the lack of observational data with high enough resolution to properly describe the initial state.

Nowadays the observational methods, the assimilation strategies and the own characteristics of the numerical models have inherent limitations that generate uncertainty in the estimation of the possible future atmospheric states. The uncertainty is amplified when the forecast period grows and when the resolution increases. Thus, the probabilistic approach seems an ideal strategy to characterize forecast uncertainty.

The atmospheric state cannot be exactly known. The analysis data always contain an error that only can be estimated. The inaccurate determination of the real atmospheric state drives to the existence of a great number of initial conditions compatible with it. A single model only provides a single solution of the future atmospheric state. The generation of multiple forecasts starting from slightly different but equally-probable initial conditions can characterize the uncertainty of the prediction (Leith, 1974).

The generation of equally probable forecasts starting from multiple realistic initial conditions introduces the probabilistic forecasting concept. A practical approximation to probabilistic forecasting based on meteorological models is the so called ensemble forecasting. The Ensemble Prediction Systems (EPS) are used operationally in several weather and climate prediction centres worldwide. The European Centre for Medium-Range Weather Forecasts (ECMWF; Molteni et al., 1996) or the Meteorological Service of Canada (Pellerin et al., 2003), among others, produce routinely ensemble predictions. These predictions have been demonstrated to be extremely useful on decision making process.

The EPS is a tool for estimating the time evolution of the probability density function viewed as an ensemble of individual selected atmospheric states. Each of these different states is physically plausible. The spread of the states is representative of the prediction error (Toth & Kalnay, 1997).

Several techniques for constructing the ensemble have been developed and applied. One of the first methods proposed for generating an ensemble of initial states is the random Monte Carlo statistical methodology. It was proposed by Leith (1974), Hollingsworth (1980) and Mullen and Baumhefner (1989), among others.

Perturbative methods that depend on the atmospheric flow are also used. These strategies are based on the generation of perturbations in the subspaces where the initial condition errors grow faster. The breeding vectors (Toth & Kalnay, 1993, 1997) or the singular vectors (Buizza & Palmer, 1995; Buizza, 1997; Hamill et al., 2000) are remarkable examples.

There are other perturbative methods that consider the model sub-grid scale errors by means of varying model physical parameterizations (Stensrud et al., 1998; Houtekamer & Mitchell, 1998; Andersson et al., 1998) or using stochastic physics (Buizza et al., 1999).

The combination of multiple model integrations initialized by multiple initial conditions determined by different analysis cycles is another strategy to generate ensembles. Using different assimilation techniques allows characterizing the uncertainties associated to the initial condition and the uncertainty associated to each model (Hou et al., 2001; Palmer et al., 2004). Finally, taking different global models as different initial conditions has been found to provide better performance than any single model system (Kalnay & Ham, 1989; Wobus & Kalnay, 1995; Krishnamurti et al., 1999; Evans et al., 2000).

The technique based on the use of multiple limited area models (LAM) and multiple initial conditions coming from several global models combined with advanced statistical post-processing techniques (Gneiting & Raftery, 2005a) has been tested in the National Centres for Environmental Prediction (NCEP; Hamill & Colucci, 1997, 1998; Stensrud et al., 1999; Du and Tracton, 2001, Wandishin et al., 2001) during the Storm and Mesoscale Ensemble Experiment (SAMEX; Hou et al., 2001). Such probabilistic predictions have also been generated over the Pacific Northwest coast (Grimit & Mass, 2002) and over the Northeast coast (Jones et al., 2007) of the United States.

The combination of multiple models and multiple analyses is part of the operational suite of NCEP (Du & Tracton, 2001) and the basic idea of the short-range EPS of Washington University (Grimit & Mass, 2002) and the Agencia Estatal de Meteorología (AEMET; García-Moya et al., 2011).

## 2. Atmosphere as a chaotic system

### 2.1. Lorenz and non-linearity

Two basic properties can dynamically characterize a chaotic system: the sensitivity to initial conditions and the topologically mixing. Sensitivity to initial conditions implies that infinitesimal changes in the system initial trajectory can lead to big changes in its final trajectory. The Lyapunov exponent (Lyapunov, 1992) gives a measure to this sensitivity to initial conditions as it quantifies the rate of separation of infinitesimally close trajectories. Generally it cannot be calculated analytically and one must use numerical techniques. In Krishnamurthy (1993) it is described how to calculate the Lyapunov exponents of a simple system. The meaning of topological mixing is that the temporal evolution of meteorological quantities in any given region of its phase space will eventually overlap with those of any other given region. This second property is necessary to distinguish between simple unstable systems and chaotic systems.

The classical example provided by Lorenz (1963) is instructive. For this reason we use it in this section, to show briefly some concepts of Chaos Theory. It comes from a simplified model of fluid convection. It consists of a dynamical system with only three degrees of freedom, but it exhibits most of the properties of other more complex chaotic systems. It is forced and dissipative (in contrast to Hamiltonian systems which conserve total energy), non-linear (as its equations contain products of dependent variables) and autonomous (all the coefficients are time independent). The Lorenz (1963) equations are:

where, in this simplified model, *x(t)* is proportional to the intensity of convection, *y(t)* proportional to the maximum temperature difference between up and downward moving fluid portions and *z(t)* is proportional to the stratification change due to convection. All variables are dimensionless, including time. The solution {*x(t), y(t), z(t)*} is unique provided that the initial conditions {*x _{0}, y_{0}, z_{0}*} are given at time

*t*= 0. This means that the system is theoretically deterministic (given a perfect representation of the initial values or the dependent variables and a perfect integration of the non-linear system). The parameters {

*σ, r, b*} are constant within the time integration and different values provide different solutions thus creating a family of solutions of the dynamical system. Lorenz (1963) chose the values

*σ*= 10,

*r*= 28 and

*b*= 8/3 which led to a chaotic solution of the system that is sensitive to small changes in the initial conditions and topological mixing. The dimension of the phase space is equal to the number of dependent variables (three in this case) whereas the dimension of the subspace reached by a given solution can be smaller as is the case of the Lorenz system. This behaviour can be demonstrated from the divergence of the flow:

Which means that an original volume in the phase space *V* contracts in time to *Ve ^{-(σ+r+b)t}*. This behaviour is related to the existence of a bounded attracting set of zero volume with dimension smaller than the phase space. An attractor is a set towards which the dynamical system evolves over time. Geometrically, an attractor can be a point, a curve, a surface or even a complicated set with a fractal structure known as a strange attractor. A solution of the Lorenz equations has an initial transient portion and after that it may be settled on a strange attractor. Figure 1 shows exemplarily a numerical solution of the Lorenz system up to

*t*= 100 from with initial conditions equal to

*x*= 0,

_{0}*y*= 1 and

_{0}*z*= 0 using a backward Euler scheme for the time stepping with

_{0}*dt*= 0.01.

As is shown with more detail in next sections the difficulty of weather forecasting is due either to the sensitivity of the atmosphere evolution to small changes in the initial conditions related to the analysis error and to the sensitivity of the atmospheric differential equations to small differences in the numerical schemes used to find a numerical solution or model error. Figure 2 shows the evolution of the Lorenz system for two different but similar initial conditions. The solutions are very similar up to t = 25 approximately in this case and after that the differences become larger. After t = 30 the value of the variables x and y cannot be predicted although z remains more predictable. In general, the time range within which the system remains predictable, depends on the initial condition, and this characteristic is called the flow dependency of the predictability of the system.

The effect of model errors can be shown by changing slightly the constant parameters *σ, r, b* (Lorenz 1963). In a more complex model, this change would correspond, for example, to a change in the parameterization of the physical processes. Figure 3 shows the temporal evolution of the Lorenz system for two different sets of constant parameters. In this case, the predictability is loose after t = 20 for all the model variables.

In light of these results (Lorenz, 1963), the question about the predictability of the atmosphere was raised for the first time, which has involved the efforts of the meteorological community to quantify it over several decades until today.

### 2.2. The predictability problem

A rigorous analysis of the chaotic properties of such a complex system as the atmosphere can be only achieved in simplified contexts. Atmosphere dynamics has been stated as chaotic and it is well established that there is an effective time barrier beyond which the detailed prediction of the weather may remain impossible (Lorenz, 1969). Predictability, or the degree to which a correct forecast can be made, depends on the spatial and temporal scales (from few hours at the mesoscale to few weeks at the planetary scale) and also on the variable (for instance, surface wind and temperature, precipitation or cloudiness).

Atmospheric chaos, uncertainty, predictability and instability are related concepts. Due to the approximate simulation of atmospheric processes, small errors in the initial conditions and model errors are the two main sources of uncertainties that limit the skill of a single deterministic forecast (Lorenz, 1963). Uncertainty limits the predictability, especially under unstable atmospheric conditions. The atmospheric instabilities related to low predictability conditions are the baroclinic instability at synoptic scales (Buizza & Palmer, 1995) and inertial and potential instabilities (e.g. deep convection) on the mesoscale, among others (Hohenegger & Schär, 2007; Zhang, 2005; Roebber & Reuter, 2002; Emanuel, 1979). This inherent limitation in predictability has led to the concept and development of ensemble prediction systems, which provide probabilistic forecasts to complement the traditional deterministic forecasts (Palmer et al., 1992).

## 3. Ensemble prediction systems

### 3.1. Uncertainty sources in numerical weather prediction

As indicated before, due to the chaotic nature of weather, there are several uncertainty or error sources in the Numerical Weather Prediction (NWP) framework that can grow and limit the predictability (Lorenz, 1963, 1969) of atmospheric flow. Forecast errors can arise due to inaccuracies in the initial condition atmospheric state estimates or due to imperfect data assimilation systems (Initial Conditions forecast error source), inadequacies of the NWP models themselves (Model Formulation forecast error source). Processes that take place at spatial scales that are shorter than the truncation scale of NWP models must be parameterized with sometimes inexact approximations thus giving us another source of forecast error (Parameterization forecast error source). One approach of NWP is to use Limited Area Models (LAMs) where the lateral conditions come from a global NWP models. This procedure is another source of forecast error (Lateral Boundary Conditions forecast error source). So far, the main error sources are: Initial Conditions (IC), Model Formulation, Parameterization and Lateral Boundary Conditions (LBC) error sources.

To the extent that these error sources project onto dynamical instabilities of the chaotic atmospheric system, such error will grow with time and evolve into spatial structures favoured by the atmospheric flow of the day. The inherent atmospheric predictability is thus state-dependent.

To estimate these uncertainties or errors, i.e. the predictability, many operational and scientific centres produce ensemble forecasts (e.g. NCEP, ECMWF, etc.). The idea of using ensemble forecasts has been know for many years (Leith, 1974). Since the early 1990s, many centres generate ensemble forecasts. The methodology that is behind is to run multiple (ensemble) forecast integrations from slightly perturbed IC (IC forecast error source), using multiple models or perturbing model formulation (Model Formulation forecast error source). Adding stochastic physics parameterizations (Ehrendorfer, 1997; Palmer, 2001) or using multiple boundary conditions (Lateral Boundary Conditions forecast error source) among others techniques is described below. The discrete distribution of ensemble forecasts can be interpreted as a forecast Probability Density Function (PDF). If an idealized forecast ensemble can be constructed that properly characterizes all sources of forecast errors, then the forecast PDFs would be reliable (see section 5) and skilful (sharper than the climatological PDF). No further information would be needed to make trustworthy forecast-error predictions, since a perfect PDF is a complete statement of the actual forecast uncertainty.

In practice, estimates of all the forecast-error sources mentioned above are inexact, leading to PDFs from real ensemble forecasts with substantial errors in both of the first two moments (mean and variance). These limitations are particularly pronounced for mesoscale prediction of near-surface weather variables, where large underdispersion results from insufficient ensemble size, inadequate parameterization of sub-grid scale processes, and incomplete or inaccurate knowledge of land surface boundary conditions (Eckel & Mass, 2005). Real ensemble forecast distributions, although generated using incomplete representations of weather forecast error sources, often represent a substantial portion of the true forecast uncertainty.

#### 3.1.1. Initial conditions forecast error source

It is clear that the atmospheric state at a given time is not perfectly known; not only the inherent observational errors alone guarantee this, but also the sparse network of observations worldwide that sample the atmosphere only at limited intervals with inexact results. In addition network density and design can yield errors in regional averages (PaiMazumder & Mölders, 2009). Another contribution to IC forecast error source is the Data Assimilation (DA) system used. Every DA system is affected by the characteristic errors of both the observations incorporated in the analysis and of the short-range model forecast, which is typically used as a background or *first guess* field to be adjusted by new observations. IC errors, however small, are exacerbated by the chaotic dynamics of the atmosphere and consequently grow non-linearly with time.

#### 3.1.2. Model formulation forecast error source

NWP model inadequacy is inevitable given to our inability to represent numerically the governing atmospheric physical laws in full. Contributions to this forecast-error source can be found in the model used which is, of course, a simplified scheme of what really happens in the atmosphere, dynamical formulation, different discretization methods, the numerical method employed to integrate the model and the different horizontal and vertical discretization resolutions used.

The model formulation forecast error in conjunction with another forecast error sources such as parameterizations has been recognized traditionally by operational forecasters in NWP centres. They usually select *a best model of the day* when producing their operational forecasts. This model selection tries to best handle the evolution of the atmosphere depending on the flow the general situation and the season of the year. The selection is driven by the subjective knowledge than some models behave better than others in some situations due to their formulation.

#### 3.1.3. Parameterization forecast error source

There are several parameterized processes in NWP models: those which are taking place at smaller spatial scales than the truncation scale of the NWP model and are not resolved explicitly by the model as convection. Another one is introduced in a simplified way due to computer time limitations like radiation, and finally processes which are not taking into account in the NWP model dynamic part as microphysics in clouds. All theses processes are called sub-grid processes. It is assumed that sub-grid processes are in equilibrium with grid resolved states and so they can be represented statistically from them. A parameterization is the statistical method used when representing the sub-grid processes. Parameterizations are always imperfect representation of atmospheric processes so they always include inherent errors (Tribbia & Baumhefner, 1988; Palmer, 1997). NWP parameterizations have a time and space scale dependency. At small scales, forecast verification is primarily concerned with the locations and amounts of precipitation and other sensible weather parameters, which are often directly affected by the assumptions used to develop the model parameterization schemes for convection and other processes. Moreover, especially for the higher model resolutions, the implicit equilibrium assumption of sub-grid processes with model state could break down being another source of parameterization uncertainty.

#### 3.1.4. Lateral boundary condition forecast error source

The LBC forecast error is only present in LAMs or regional models, which have as inputs lateral boundary values spatially and temporally interpolated from a coarser resolution grid-point or spectral model. So the coarser model errors are translated into LAMs as LBC error source. For instance, a possible configuration for a LAM EPS could include lateral boundary conditions from an ensemble of global forecasts.

### 3.2. Techniques used by global models

For many years operational forecasters, particularly medium range forecasters in meteorological services, have had access to some forecast products coming from global NWP centres other than their own. They routinely compare forecasts from different centres to assess the confidence in the forecasts of their own models, and to determine possible alternative forecasts. This set of available forecasts is often called the *Poor Man's* (Ebert, 2001) ensemble because its production is relatively cheap compared to the cost of developing and running a full EPS such as the ECMWF and NCEP ones. It is called *Ad hoc* ensembles by some other authors. Theses ensembles are cheap and easy to create, but they are not generated in a controlled and systematic approach. Not only are they not calibrated but also some ensemble members may be always quite more skilful than others. The hypothesis of equiprobability of the ensemble members is less guaranteed than others EPS which is a major drawback.

Hoffman and Kalnay (1983) proposed a *time-lagged* method or *Lagged Average Forecast* (LAF) method. The time-lagged method uses forecasts from lagged starting times as ensemble members. These members are easy to construct but they lack any scientific motivation. On the contrary, LAF perturbations are realistic short-term forecast errors. However, LAF ensemble forecasting has the disadvantage that most of the times earlier forecasts are considerably less skilful than later forecasts. This drawback can be partly resolved by either using different weights for different members of the ensemble or by scaling back the larger errors to a reasonable size. This procedure is the basis of the Scaled Lagged Average Forecast (SLAF) technique (Ebisuzaki & Kalnay, 1991).

The *multi-model SuperEnsemble* technique (Krishnamurti et al., 1999) is a powerful method to construct EPS. Several different models outputs are put together with appropriate weights to get a combined estimation of weather parameters. Weights are calculated by square minimization in a period that is called *training period*.

A better solution is to sample the different error sources that were indicated before. Depending on the sampling technique we obtain different methods: a Monte Carlo approach as proposed by Leith (1974), Hollingsworth (1980) and Mullen and Baumhefner (1989) among others. In general, the technique consists of sampling all sources of forecast error, by adding or perturbing any input variable (analysis, initial conditions, boundary conditions etc.) and whatsoever meteorological parameter that is not perfectly known. These perturbations can be generated in different ways. The main limitation of the Monte Carlo approach is the need to perform a high number of perturbations in order to have a proper description of the initial uncertainty, which is usually far from the available computational resources. This limitation leads to reduced sampling by just sampling the leading sources of forecast error due to the complexity and high dimensionality of the system. Reduced sampling identifies active components that will dominate forecast error growth.

IC forecast error source have a dominant effect. To sample it several techniques have been available. One of them is the initial perturbations method, which consists of adding small perturbations to the initial analysis, such as NCEP’s *breed mode* method (Toth & Kalnay, 1993, 1997; Tracton & Kalnay, 1993). The breed mode method is based on the idea that the analysis created by the data assimilation scheme used will accumulate growing errors. As it can be seen in Figure 4 breeding vectors give a sampling of the growing analysis error: a random perturbation is added to the analysis, evolved in time by integrating the forecast model, rescaled and reintroduced as a perturbation. After several cycles only the fastest growing errors remain.

An alternative to the breed mode is the ECMWF’s *singular vector* method (Palmer et al., 1992; Molteni et al 1996) which tries to identify the dynamically most unstable regions of the atmosphere by calculating where small initial uncertainties would affect a 48 hour forecast most rapidly. It needs an adjoint model. Singular vectors give a sampling of the perturbations that produce the fastest linear growth in the future. There are only a relative small number of directions in the phase-space of the atmospheric system along which the most important processes occur. Maximum growth is measured in terms of total energy. The adjoint of the tangent forward propagator with respect to the total norm is defined, and the singular vectors (the fastest growing perturbations) are computed by solving an eigenvalue problem. Singular vector method is schematically described in Figure 5.

In addition to the breeding and singular vector methods there are *Ensemble Transform Kalman Filter* technique (ETKF; Bishop et al., 2001; Wang & Bishop, 2003) and *Ensemble Data Assimilation* (EDA; Houtekamer, 1996; Buizza, 2008). ETKF is similar to the breeding method except that the rescaling factor is replaced by a transformation matrix. It produces an improved ensemble dispersion growth. It is used at the UK Meteorological Office. In EDA, an ensemble of assimilations is created from different analyses which have been generated by randomly perturbing the observations in a manner consistent with observation error statistics.

Model forecast error source, i.e. model formulation and parameterization error sources together, is another component to take into account. To represent model uncertainty several approaches have been used: the m*ulti-model* approach (e.g. DEMETER; ENSEMBLES; TIGGE; Krishnamurti et al, 1999), *multi-parameterizations* or *multi-physics* approach (Houtekamer, 1996), *stochastic parameterizations* (Buizza et al., 1999; Lin & Neelin, 2002), *multi-parameter* approach (Murphy et al., 2004), and *Stochastic-Kinetic Energy-Backscatter* approach (Shutts & Palmer, 2004; Shutts, 2005).

So, in order to sample model forecast error an ensemble forecasts are produced by using different numerical models (multi-model approach). The multi-model approach implies equiprobable members which is not always the case. An alternative method for sampling model forecast errors is using different physical packages (multi-physics approach). Another approach is the *stochastic parameterization* approach applied at ECMWF (Buizza et al., 1999). It is based upon applying stochastically perturbing the total parameterized tendencies with a multiplicative noise. The multi-parameter approach tries to take into account the significant uncertainty in some parameters in NWP models, for instance, by using different values in each ensemble member.

Finally, the *Stochastic-Kinetic Energy-Backscatter* approach addresses a missing physical process, the upscale energy cascade energy from the grid scale to synoptic scales lost due to the excessive dissipation energy in NWP models.

### 3.3. Techniques used by limited area models

Not only global models can be used in building EPS, but also Limited Area Models (LAM) can be used to create LAM EPS, normally used for the short range. Error sources in LAM EPS are the same as in global EPS, but LAM models require lateral boundary conditions that update the weather situation regularly throughout the integration. These lateral boundary conditions introduce a main source of uncertainty in LAM ensembles. Both LBCs and ICs give their contribution to the spread and skill of the system (Clark et al., 2009). All the techniques discussed so far can be applied to generate LAM EPS. A very popular generating technique is the downscaling of global ensemble forecasts. This technique consists of using the selected global ensemble members (chosen by clustering) as initial and boundary conditions for a limited area ensemble system. The difficulty is that the perturbations generated from the global EPS are usually effective only on the medium range and large scales. Therefore they are not likely optimal for short range ensemble forecasts. Another technique for sampling lateral boundaries forecast error source is *multi-boundary* technique. In the multi-boundary technique, several different global models supply the lateral boundary conditions needed by the LAM model. One example of the use of the multi-boundary technique is the AEMET Short Range Ensemble Prediction System (AEMET-SREPS; García-Moya et al., 2011). AEMET-SREPS uses the multi-boundary method in addition to the multi-model method. It is built by using a set of LAMs and a set of deterministic global models that supply the initial and boundary conditions. The system is focused on short-range forecast and has been developed to help in the forecast of extreme weather events (gales, heavy precipitation and snow storm) and provides forecasts with good reliability, resolution and discrimination consistently with the analysis in the large-scale flow.

## 4. Uncertainty representation and weather forecasting products

### 4.1. Uncertainty representation

In statistics, uncertainty is represented by means of the Probability Distribution Function (PDF). Let us consider a random variable *x* that we do not know, a priori, anything about its nature. The question is whether we can infer something about it. Let us take *n* different values of *x* that belong to the same population. When we construct the histogram of these values, we obtain an approximation of its PDF. As an example, we could think of *x* as the mean monthly temperature of April at a surface observation station. Then the population would be *the mean monthly temperatures of April at that station*. If we restrict us to only the period 1981-2010, then the *n*=30 values of *x* would form the sample space.

The PDF gives us information about the behaviour of the random variable *x*. For example, let us take the normal or Gaussian PDF of which the analytical formula is:

Here *σ* is the standard deviation and *μ* is the mean. Figure 6 shows this distribution graphically. Now we can infer something about the nature of the variable *x*. From Figure 6 we can say that there is a value *μ* around which all the random variables are distributed symmetrically. Likewise, *σ* is a measure of the standard deviation of *x* from its mean. We can think of *σ* as a mean error (or uncertainty) we would have if we approximated any possible value of *x* by *μ*. In resume, the PDF gives us a depiction of all the possible values of *x* and their associated probabilities of occurrence. This procedure results in an explicit and quantitative way of representing the uncertainty of a random variable.

In the case of a Numerical Weather Prediction (NWP) system there are about 10^{8} different random variables *x*, each one corresponding to each degree of freedom of the model. This fact makes it computationally unfeasible to integrate the Liouville equation (sometimes referred to as Fokker-Plank equation when random processes are included to account for, for example, model error) that describes the time evolution of a PDF. A practical way to resolve this problem is to use an EPS. An ensemble prediction tries to estimate the uncertainty of the forecast by discretizing the forecast PDF for each model parameter at each grid point in *N* values corresponding to the *N* ensemble members. As an example, Figure 7 presents the PDF of a 60 hours two metres (2m) temperature forecast of the AEMET-SREPS for the grid point closest to Sevilla, Spain. This ensemble has 25 members, but in this case, there was one that did not integrate properly, so *N*=24. It is easy to see that the more members the ensemble has, the higher is the resolution of the PDF.

In ensemble prediction, a simplified way of representing the uncertainty of the forecast is the *spread* (Toth & Kalnay, 1997); the standard deviation *σ* (4) of the PDF quantifies how much the ensemble members deviate, on average, from the mean, and it is often used as a measure of the spread (Wilks, 2006):

Where *f*_{i} is the forecasted variable by member *i*, and *em* is the ensemble mean, e.g., the mean of the *N* forecasts. This parameter can be calculated on each grid point. In the case of the 2m temperature forecast (Figure 7)
*em* and the spread are 32.3 and 1.5 ^{o}C, respectively. The latter can be interpreted as an estimate of the error of the deterministic forecast so that the higher the spread, the more uncertain is the forecast. Other measures of spread, more robust or resistant, can be alternatively used, e.g. the interquartile range (Wilks, 2006).

### 4.2. Raw products

Ensembles are composed of members that are deterministic predictions, and allow providing individual deterministic information (García-Moya et al, 2011). This information can either help the traditional staff to understand ensembles, and can provide support in the probabilistic interpretation. Far beyond this, ensembles provide their most useful information when we look at them as intrinsically probabilistic prediction systems. In this context, most of the ensemble outputs reflect this probabilistic nature. Probability is a rich and reasonable model to describe and understand many aspects of the physical world, but the interpretation of ensemble outputs must be learned and used carefully beyond the straightforward interpretation, because (especially for *deterministic forecasters*) some interpretations can be in conflict with *common sense*. Given a grid point with N forecast values x'_{i} (for an ensemble with N members), we call raw products when only these N values are used straightforward. Three basic examples of raw products exist.

#### 4.2.1. Stamps

The deterministic outputs for all the ensemble members can be plotted as usual meteorological charts (see for instance Figure 8 with MSLP and T850 for the ECMWF EPS). These usual *postage stamp charts* comprise the control forecast (if there is any, top left), the perturbed members (below) and the corresponding high resolution deterministic forecast (if exists, beside the control). The difficulty is that the forecaster would have to deal with an amount of information: 51 scenarios in addition to the high resolution deterministic forecast.

#### 4.2.2. Plumes

In a given location, we can provide N forecast values for that place (either by bi-linear interpolation or some other fine process which could account for height variability). Moreover, we can plot the evolution with forecast step for all the N members, i.e. we would plot N curves. Like on the stamp charts, the control forecast can be highlighted and the high-resolution deterministic forecast can also be plotted (e.g., Figure 9). The difficulty is similar to that one of the stamps namely the necessity to deal with such an amount of information. For a specific location and parameter, plumes can help the forecast guidance and, in fact, often provide information about the uncertainty and general trends.

#### 4.2.3. Spaghetti

As a third example, we show the spaghetti charts. For a given (often dynamical) field, it is rather impossible to overlay N member charts. But picking a selected isoline of interest, we can plot one line per member and, thus, the whole plot would contain N lines. Typically, the control is highlighted, and a higher resolution deterministic forecast can be included (e.g. Figure 10). As plumes, this kind of plot can help the forecaster to provide information about the uncertainty.

All of these raw outputs share the same shortcoming: the inherent difficulty in the forecast guidance for handling the huge amount of information they provide. This issue is addressed using derived probabilistic outputs that compact this information naturally.

### 4.3. Derived probabilistic products

Probabilistic outputs are derived computationally from the PDF representation, which is assumed to be provided by the ensemble members. These products reflect the probabilistic nature of the ensemble, visually and conceptually. They provide explicit, quantitative and detailed information about uncertainty, and this fact is a real breaking point with respect to deterministic model products. They address the issue of providing compact information in this natural way. Ensembles nowadays can provide ideal complementary information to a higher resolution deterministic model.

For a given grid point, there are *N* forecast values *x' _{i}* for an ensemble with

*N*members. Without further information about the skill of the members, we assume Laplace principle of equal probability, dealing in this case with a discrete PDF. An estimate of the forecast probability p of exceeding a threshold t is given by the well-know formula where the indicator

*I(x'*is usually defined as

_{i})*I(x'*=1 if

_{i})*x'*>

_{i}*t*,

*I(x'*=0 otherwise (Ferro 2007b):

_{i})The corresponding inverse is the computation of percentiles, i.e., for a given probability *p*, what is the actual forecast value *x’* for which *p* = P(*x’*). By adding further information, we can improve the PDF (e.g. by calibration) and the computation would be different. By taking this simple discrete model, we can compute the probabilities of exceeding thresholds, the percentiles for given probabilities, the summary statistics (e.g. mean and standard deviation), etc.

#### 4.3.1. Ensemble mean and spread charts

The ensemble mean (the arithmetic mean of all the ensemble members) is not always a feasible meteorological situation because it is obtained as a result of a statistical operation, not from a numerical model (Buizza and Palmer, 1997). So, it is strongly discouraging in forecast guidance to use the ensemble mean without special care, if at all (García-Moya et al, 2011). However, the ensemble mean is often plotted in charts together with the standard deviation (the latter as a measure of spread), to help with the understanding of the atmospheric flow (e.g. Figure 11). The standard deviation is sometimes normalized.

#### 4.3.2. Probability maps and percentile maps

Given a binary event (e.g. precipitation forecast > 5 mm/6h) we can plot the spatial distribution of the forecast probabilities that the EPS provides (see Figure 12). Similar plots can be made for the percentiles. These maps provide the forecasters with useful guidance by showing them where it is more probable for an event of interest to occur (e.g. representative precipitation that exceeds 5 mm/6h).

#### 4.3.3. EPS-grams

Box-plots (Wilks, 2006) and similar graphs give a quick, visual and simple representation of a distribution of numbers, a discrete PDF. Extending this idea by including the evolution with forecast time of the main weather parameters at a given location, we obtain plots that are *meteogram based* and often called *EPS-grams*. The building brick is the box-plot: it displays the median, minimum, maximum, percentiles 25 and 75 and sometimes also percentiles 10 and 90. Box-plots are displayed for a series of forecast steps. This procedure is typically applied for the more sensitive parameters to forecast e.g. cloud cover, precipitation, ten metres (10m) wind speed and 2m temperature (e.g. Figure 13). Special care must be taken with EPS-grams interpretation (Persson & Grazzini, 2005) by comparing the location point and the nearest grid-points: distance, land/sea contrast and height must be checked in order to properly use the information that EPS-grams provide. Anyway, the EPS-grams are the most popular and probably useful plots to forecast the weather in a location by taking into account the uncertainties.

#### 4.3.4. Clustering

A natural way of examining the number of atmospheric scenarios that EPS provide (e.g. stamps) is by similarity: one can gather scenarios in groups that could lead to similar weather conditions (Ferranti, 2010). This process can be done by *eye-ball* or can be carried out computationally by using a clustering algorithm (Ferranti, 2010) that fits the corresponding meteorological requirements. This procedure is often expensive and requires an extra task of defining the similarity criterion, but forecast guidance can improve substantially with the use of clusters. Examples of algorithms used are the Ward algorithm (Ferranti, 2010) and the tubing; both have been used operationally at AEMET (Figure 14). Here clusters have proved to be a very useful guidance in medium range forecasts by summarizing the more important and distinct scenarios (Ferranti, 2010).

#### 4.3.5. Extreme forecast index

Extreme events are not always severe, but severe events are often extremes. An index of *extreme forecast* can be computed using the model climatology as a reference, rather than the climatology of observations (Lalaurette, 2003). When observations are used, the forecast is not really prone to fall in the tail of the climatological distribution, and this fact is addressed by using the model climatology. The Extreme Forecast Index (EFI; Lalaurette, 2003) is a quantitative measure of how extreme is the EPS forecast when compared with the model climatology. The EFI can be plotted in a chart (Figure 15), and this chart is especially useful for weather parameters. Thus the EFI is used by forecasters as an early warning tool to highlight where severe events could happen.

### 4.4. Interpretation for weather forecasting

As an example for the application of probabilistic forecasting we present a real case of extreme winds that is fully described in Escribà et al. (2010). This section is not intended to be a detailed manual of the utilization of probabilistic products in operational forecasting. More extend and concise information can be found for example at www.ecmwf.int.

At 00 UTC on 24 January 2009 an explosive cyclogenesis in the Gulf of Vizcaya reached its maximum intensity with an observed surface pressures less than 970 hPa on its center. During the cyclone’s path through the south of France strong westerly and north-westerly winds occurred over the Iberian Peninsula (> 150 km/h). These winds led to eight casualties in Catalunya, the north-east region of Spain.

In Figure 16 are represented three probabilistic forecasting products, the ensemble mean, the ensemble spread and the probability of having wind speeds greater than 15 m/s (54 km/h). The weather parameter analyzed is 10m wind speed. These fields correspond to the H+60 prediction of the AEMET-SREPS initialized at 00 UTC on 22 January 2009. The ECMWF reanalysis is also shown as verification.

Even though wind speed values plotted in the maps are not extreme, they do not exceed 20 m/s or 72 km/h, this fact has to be taken carefully because we are representing mean values of wind at a forecast time, i.e. a mean over a time interval equal to the last forecast time step of the forecasting model (which in this case is around 5 minutes). As a first approximation we can estimate the wind gust (maximum wind) as twice the value of the mean wind (this factor can be roughly obtained comparing temporal series of mean wind and wind gusts from observation ground stations). In this case, such an approximation would give us extreme winds of about 150 km/h, similar to those observed.

The ensemble mean (Figure 16) can be thought as a skilful deterministic forecast that comes from the ensemble. When we compare it with the verification we can highlight there is a good agreement in the overall patterns. Looking in more detail we can select three zones where there is more discrepancy: a.) south of Catalunya (yellow ellipse), b.) Aragon and Valencia (white ellipse) and c.) south-east of France (green ellipse). It is especially interesting to analyze zone a.), where the casualties occurred. The question is whether the ensemble can estimate in some way the error in the prediction; the spread is expected to give information on this.

The ensemble spread gives us the areas of more uncertainty in the prediction and is a measure of it. For zones a.) and c.) it has values around 5 m/s, which are considerable. In the case of a.), the spread estimates properly the prediction error, giving valuable information. When making the forecast, we would say that it is possible to have wind speeds greater than 16 m/s. This is not the case in zone c.), where the spread is not enough to explain the discrepancy between the ensemble mean prediction and the verification. In this case, the probabilistic forecast is badly displaced to the east. In zone b.), a lower spread (around 3 m/s) has also the ability to describe the prediction error.

Finally, the probability forecast enforces the general forecast by determining the areas of maximum confidence of occurrence and quantifying this confidence in a number. In this sense, we can say that the probability of having mean wind speed greater than 15 m/s (54 km/h) or wind gusts of more than 100 km/h in zone a.) is between 30% and 70%, which is more than the term *possible*.

## 5. Ensemble forecast verification

NWP models must be compared with a good representation of the observed atmosphere. This process is often called *forecast verification*, and raises a number of concepts and issues. With verification, we assess the quality and value of forecasts (Murphy, 1993), by using metrics or measures often called scores. By providing detailed information about forecast performance, verification can help in model improvement (developers) and forecast guidance (forecasters). Comprehensive descriptions of standard verification methods can be found in Wilks, 2006 and in Jolliffe & Stephenson, 2003), whereas in Candille & Talagrand, 2005 and in Stensrud and Yussouf, 2007 there is a thorough study of probabilistic forecasts, including ensemble forecasts.

Different frameworks are available for verification. Observations (ob) and forecasts (fc) can be compared, either *whole set to whole set* or *fc to ob* by using their statistical summary properties (measures-oriented approach), as distributions (distributions-oriented), as features (features-oriented), etc. In any case, to compare observations and models is like comparing apples and oranges: they are often different kinds of atmospheric representations, and we need to transform one or both of them into *comparable* representations. This step involves non-trivial issues like interpolation, representativeness, correlation, noise, etc.

An ordinary example of the difficult issue in comparing apples and oranges is the performance assessment of quantitative precipitation forecasts (QPF) from a deterministic model. European meteorological offices provide to the ECMWF 24-hour accumulated precipitation reports from their high-density rain gauges networks. Forecast values are regularly spaced, while observations are not. One comparison method is to interpolate forecast values to observation points (Rodwell, 2010). Special care should be taken with the impact of spatial density of observations and the potential lack of statistical consistency due to spatial dependence between close ones. To address these issues, one can compute an observed quantitative precipitation estimate (QPE) using a simple up-scaling technique (Ghelli & Lalaurette, 2000; Cherubini et al., 2002) whereby stations are assigned to model grid-boxes and then *averaged* to produce one value to be assigned to the corresponding grid-point. Such a grid value can be for instance a weighted average, and can be compared to the model precipitation forecast, which is also relative to the grid-box areal average, as they both refer to the same spatial scales. Whether to choose interpolation or up-scaling methods it depends on the case (see Figure 17).

Moreover, in the comparison of the performance of QPF from two different forecasting models, further issues arise. The grid spacing of the two models might be different. If observations are gridded to the finer resolution, then the coarser model might be penalized. On the other hand, if observations are gridded to the coarser resolution, the comparison can be fair but the higher resolution model is not given a chance. How to compare the way in which both models represent structures at their own scale is a non-trivial issue. PaiMazumder & Mölders (2009) assessed the impact of network density and design on regional averages using real sites and model simulations over Russia. They find that generally, the real networks underestimate regional averages of sea level pressure, wind speed, and precipitation while overestimate 2m temperature, downward shortwave radiation and soil temperature.

### 5.1. A first requirement: Deterministic performance of ensemble members

The assessment of the deterministic quality of the ensemble members is a first requirement in the development of an EPS. When the quality of the ensemble members is similar, then any member can be weighted equally in the computation of a probabilistic forecasts, i.e. they are assumed to be equiprobable. Once provided this individual quality, then some other properties can be considered (see below). In addition, the ensemble mean should show a better deterministic performance than any individual member in terms of Root Mean Square Error (*RMSE*) (Leith, 1974; Murphy, 1988; Whitaker & Loughe, 1998; Ziehmann, 2000).

As a visual representation, either time-series or evolutions with forecast step for bias and *RMSE* are usually depicted for synoptic parameters (e.g. Z500) for each member and also for the ensemble mean. As an example, Figure 18 shows *BIAS* and *RMSE* evolution with forecast length for the different ensemble members and the ensemble mean.

### 5.2. Large scale flow: Statistical consistency with the observations/analysis

As a probabilistic forecast, an EPS must be statistically consistent with the observations in the large scale flow given the EPS domain is large enough. At this scale, the model analyses of upper-air dynamical fields (e.g. 500 hPa geopotential height, Z500) can be used for comparison, by providing a larger sample and covering the whole integration domain, and so by giving no priority to land areas where the density of observations is higher. Verification against SYNOP/TEMP observations is expected to give worse but qualitatively similar results. This statistical consistency can be assessed in several ways; two methods are shown here: the rank histogram and the spread-error diagram.

On each grid-point, either the analysis or each of the ensemble member values are assumed to be independent realizations of the same atmospheric process and hence equally probable. Here, the rank of the analysis is an integer number according to the position of the analysis value in the sorted list of forecast values. The rank histogram (Anderson, 1996; Hamill & Colucci, 1997, 1998; Hamill, 2000; Candille & Talagrand, 2005) can be used to check if the verifying observation is statistically indistinguishable from the set of forecast values (*reliable*). Such a system must show an approximately flat-shaped rank histogram (Figure 19 middle). Some outliers (“U” shape, Figure 19 bottom left) would indicate sub-dispersion that are typical in current EPS operational systems, while overdispersion would correspond to the opposite (inverted “U” shape, Figure 19 bottom right). The bias would produce rank histograms with positive (negative) or negative (positive) slope (bias) (see Figure 19 top left and top right, respectively).

Furthermore, the ensemble spread (often measured by the standard deviation with respect to the ensemble mean or the control forecast if possible) and the error of the ensemble (measured by the *RMSE* with respect to the analysis for either the control forecast or the ensemble mean) should show a linear relationship and a similar growth rate with respect to forecast step (Buizza & Palmer, 1997; Whitaker & Loughe, 1998). An EPS is expected to sample the uncertainties of NWP models (ensemble *spread*), as well as to give explicit and quantitative information about the predictability of the atmosphere (represented by the ensemble error). According to this, an ensemble can be statistically consistent (*calibrated*) or, on the other hand, can be underdispersive (quite common in operational ensembles) or overdispersive (e.g. Figure 20).

### 5.3. Weather parameters: Binary events

For the performance assessment of weather parameters (e.g. precipitation, 2m temperature, 10m wind), with larger variability in space and time, the use of observations is encouraged, as they are not as smooth as upper-air field analyses. In the distributions-oriented framework (for a detailed description see Joliffe & Stephenson, 2003; Wilks, 2006), the performance of an EPS can be done measuring its response to a set of binary events (occurrence / non-occurrence, e.g. to exceed a threshold). The EPS behavior in this context is described by different properties: reliability, resolution, sharpness and discrimination. Brier (Skill) Score together with *ROC* curves provide a framework to give measures to theses properties. Moreover, the benefits of using a forecasting system can be shown with the so-called relative value, a quantity that depends on the forecast user's cost/loss ratio. In this framework, the joint distribution of forecasts and observations gives complete support for the computation of scores.

#### 5.3.1. Formal framework

By considering a binary event *X* of the parameter x exceeding a threshold *t* ({*X: x*>*t*} e.g. rain over 5 mm), we compare a forecast probability *p* (number of member values exceeding *t*, *p* = {0/*N*, 1/*N*,... *N*/*N*}) and the corresponding a posteriori observation probability *p _{0}*, which is usually taken as

*p*= {0,1}, depending on whether the event took place or not. However, if observational uncertainty was considered, then

_{0}*p*could take any value in the interval [0,1]. In this probability space, a natural extension of the

_{0}*RMSE*is the Brier Score, defined as

*BS*= E [(

*p – p*)²], where E[] is the expectation value over all forecast-observation pairs.

_{0}*BS*is negatively oriented and

*BS=*0 if and only if

*p=p*for any pair, while

_{0}*BS*=1 indicates the worst possible forecast.

The *joint distribution* (Murphy, 1988) of forecasts and observations can be represented by two distributions that describe completely the system performance: *g*(*p*) and *p’*(*p*), where *g*(*p*) is the forecast probability distribution (relative frequency of forecasts with probability *p*) and *p’*(*p*) gives the conditional observation distribution (relative frequency of forecasts with probability *p* and for which the event did happen). The expectation values can be computed through a partition in probability space according to the possible forecast probability values, i.e., the number of members (Santos & Ghelli, 2011):

The base rate *p _{c}*=E[

*p’*(

*p*)]=E[E

_{p}[

*p*]] is the frequency of occurrence of the event. Using these two distributions, a decomposition of the

_{0}*BS*can be done (Jolliffe & Stephenson, 2003; Candille & Talagrand, 2005):

the components are meaningful: *Reliability* (*B _{rel}*) measures the straight correspondence between probabilistic forecasts

*p*and conditional observation frequencies

*p’*(

*p*), and can be improved by re-labeling of probability intervals (a process that should be called

*re-labeling*calibration to avoid confusion).

*Resolution*(

*B*) gives a measure of variability of conditional observations

_{res}*p’*(

*p*) around the base rate, and cannot be improved by re-labeling, thus it gives an upper bound for inherent skill. For a perfectly reliable system the reliability component vanishes (

*p=p’*(

*p*) for all cases), and the resolution is equal to the

*sharpness*, a measure of variability of the forecast probability distributions, or how often different forecast probabilities occur without taking into account the observations. The uncertainty component (

*B*) is the variance of the probabilistic observations

_{unc}*p*and corresponds to the value of the

_{0}*BS*using the sample climatology as forecast (i.e. issuing always a forecast probability

*p*=

*p*a system is perfectly reliable

_{c}*B*=0, and shows no resolution

_{rel}*B*=0); it depends only on the observations and is usually taken as a reference for the Brier Skill Score (

_{res}*BSS*), if special care is taken with the interpretation (Mason, 2004). The same decomposition can be applied to the

*BSS*(Candille & Talagrand, 2005):

To give a summary of performance measures comprising the response to several thresholds, the Ranked Probability Score (and the corresponding skill score) can be used, either the discrete or the continuous version (Hersbach, 2000).

A complementary measure of ensemble performance is the *discrimination* or ability of a system to distinguish the ocurrence or non-ocurrence of a binary event *X* given the observations according to Signal Detection Theory (Kharin & Zwiers, 2003). The discrimination is related to the hit rate (*H*) and the false alarm rate (*F*) for a given base rate *p _{c}* (Candille & Talagrand, 2008):

As a measure of discrimination, the area *A* under the Relative Operating Characteristics (*ROC*) curve (*H* versus *F*) is often used, with *A*=0.5 for the sample climatology (no skill) and *A*=1 for a perfect forecast (Santos & Ghelli, 2011). *ROC* Skill Area (*RSA*) can be used instead: if *A* is the area under the *ROC* curve, *RSA*=2*A*-1 gives values in the interval [-1,1], 1 for a perfect forecast, 0 for no skill and -1 for a potentially perfect forecast after calibration. Discrimination is related to resolution, but they do not measure exactly the same property and, especially if observational uncertainty is present, they can show different and indicative behaviour. While *BSS* is potentially insensitive to extreme events, *RSA* is not (Gutiérrez et al., 2004), whereas *RSA* can be insensitive to some kinds of forecast biases (Kharin & Zwiers, 2003).

Another interesting complementary property, beyond performance, is the Economic* Relative Value* (*RV;*
Richardson, 2000). By crossing a *contingency table* (forecast yes/no by occurrence yes/no of the event) with an *expenses matrix* (preventive action yes/no by occurrence yes/no, that includes the cost *C* of the action and the loss *L* in case of occurrence), it can be computed the relative economic reduction (*RV*) of using the forecast comparing with the sample climatology. *RV* depends on the base rate *p _{c}* and also on

*C*and

*L*, i.e. it depends also on the user.

#### 5.3.2. Visual presentation

The properties described above can be visualized in several ways. *Sharpness histogram*: *g(p)* distribution is put in a histogram along probability intervals. A predicting system with good sharpness would issue forecast probabilities close to 0 and 1. Sharpness is often plotted as an inset on the attributes diagram. *Attributes diagram*: *p’(p)* distribution is plotted on the Y axis against probability intervals *p* on the X axis. Several straight lines are plotted as reference: the diagonal (representing perfect reliability), the no-resolution line (corresponding to the sample climatology as forecast), and the no-skill line (forecasts with no skill w.r.t. the climatology, i.e. *B=B _{unc}*). Figure 21 (left) illustrates this. Some examples of forecasting systems are idealized in Figure 21 (right).

*BSS decomposition time series*: *BSS* (and its components *BSS _{rel}* and

*BSS*) time series are plotted in curves. As

_{res}*BSS*is positively oriented (the larger the BSS, the better is the performance) and its components are not, special care must be taken (see Figure 22) (Santos and Ghelli, 2011).

*ROC curves*: the hit rate (

*Y*axis) is plotted against the false alarm rate (

*X*axis) (see Figure 23 left). Here a deterministic forecast is compared to an EPS.

*RV envelopes*: the

*RV*can be plotted on the

*Y*axis, the cost-loss ratio C/L on the

*X*axis and provide one curve for a deterministic model. For an

*N*members ensemble,

*N*curves can be plotted (we can plot

*RV*for any probability interval in the partition described above), or eventually, the envelope that covers all the

*N*curves. The user can decide on the

*C/L*intervals for optimal use of the forecasting system. In this sense, an EPS can be also compared with a deterministic model (see Figure 23, right).

### 5.4. Verification issues and prospective

Computational resources, object-oriented programming languages and data-base improvements give a boost to forecast verification. In the last decades, the forecast verification community has started to address some important issues that have an impact in the interpretation of verification scores and introduction of new conceptions. Either EPS specific or not, some of these issues are of large interest and hence are introduced here.

#### 5.4.1. Pooling versus stratification

To compute statistically significant scores, samples must be large (many fc-ob pairs) and the corresponding significance tests should be applied (e.g. t-Student). On the other hand, mixing non-homogeneous sub-samples (e.g. different seasons) can lead to misleading performance information. In this context, the dimensionality problem (Murphy, 1988) can be a critical stumble in practice (Candille & Talagrand, 2008): computing completely consistent (from the strictly mathematical point of view) scores can turn out to be an infeasible task, and this issue often leads to a practical compromise: large samples are created without mixing different ones, according to possibilities. E.g.: the splitting of seasonal behavior that could be hidden in the overall average.

#### 5.4.2. Flow-dependent verification

Flow-dependent sample stratification can improve other traditional ways of stratifying (e.g. seasonal), and nowadays can be tackled with clustering techniques (Ferranti & Corti, 2010).

#### 5.4.3. Sampling uncertainty

Given the dimensionality problem and pooling-stratification compromise, the available fc-ob pairs are, in practice, relatively small samples from the all possible realizations of the model and observing systems. Thus, the scores computed are only sample measures of the population quantities, and there is a sampling uncertainty related to this process (PaiMazumder & Mölders, 2009). Any verification report should include this uncertainty, with error bars, confidence intervals, etc. (see e.g. Efron & Tibshirani, 1997; Bradley et al., 2008).

#### 5.4.4. Spatial scales of forecasts and observations

Spatial scales are a key point (see examples above). For example, the double penalty (Ebert & Gallus, 2009) is a well-know related issue. The relatively recent development of new methods that account for spatial patterns, e.g. CRA (Ebert & Gallus, 2009), MODE (Davis et al., 2009), SAL (Wernli et al., 2008) are still under research, but show promising results and can lead to a framework of diagnostic verification (closer to subjective verification in the sense that provides information that can better help model developers and weather forecasters). For a comprehensive overview, see (Gilleland et al., 2009).

#### 5.4.5. Extreme and severe weather

Extreme and severe weather are often introduced together, but they are not the same. Extreme events are rare events, with low base rates and belong to the tail of the corresponding climatological distributions. Severe events are those that have an impact (human and material) on society. Severe weather verification must include extra information from outside the meteorological context, whereas verification of extreme events is still in an early stage (Ferro, 2007a; Casati et al., 2008), and some alternative scores are under research, e.g. the Extreme Dependency Score (EDS; Stephenson et al., 2008; Ghelli & Primo, 2009).

#### 5.4.6. Observational uncertainty

In forecast verification it has been traditionally assumed that the observation error is negligible when compared with the forecast error. This assumption can be consistent for longer forecast ranges, when the forecast error is much larger than the observation uncertainty. Several studies have extended the verification problem to a more general framework, in which observations are described together with their uncertainty. They show sometimes a surprising result: traditional scores generally underestimate EPS performance (e.g., Saetra et al., 2004; Candille & Talagrand, 2008; Santos & Ghelli, 2011).

#### 5.4.7. Ensemble size

Differences in ensemble size can have an impact on performance assessment (e.g. compare a 16 members EPS with a 51 members EPS). The difference in size would, in principle, give better performance to the larger EPS a fact that should be at least taken into account. This issue is addressed by various authors (e.g., Buizza & Palmer, 1998; Ferro, 2007b; Ferro et al., 2008).

## 6. Statistical post-processing

EPS evidence systematic errors like do the deterministic NWP models. Calibration is the process of correction of the ensemble PDF to adjust it to the actual (and unknown) forecast uncertainty. The main point of calibration techniques is to use the information of the former prediction's skill to correct the current probabilistic forecast.

Different methodologies have been proposed recently to build calibrated probabilistic forecasts from ensembles, including Bayesian Model Averaging (Raftery et al., 2005), Logistic Regression (Hamill et al., 2008) and Extended Logistic Regression (Wilks, 2009), Non-homogeneous Gaussian Regression (Gneiting et al., 2005b) and Ensemble Dressing (Roulston & Smith, 2003; Wang & Bishop, 2005).

### 6.1. Bayesian Model Averaging (BMA)

Bayesian Model Averaging (BMA) is a statistical post-processing method that generates calibrated and sharp predictive PDFs from EPS (Raftery et al., 2005). The BMA predictive PDF of a weather variable is a weighted average of PDFs centred on the individual bias-corrected forecasts. The weights reproduce the predictive skill of the ensemble member over a training period.

If the forecast errors are approximately Gaussian distributed such as surface temperature or sea level pressure, BMA can be applied (e.g. Raftery et al., 2005; Wilson et al., 2007). For non-Gaussian error distributions using a mixture of skewed PDFs allows to extend the BMA methodology to this kind of weather parameters; A combination of point mass at zero and a power-transformed gamma distribution, for instance, can be applied to quantitative precipitation (Sloughter et al., 2007) and a mixture of gamma distributions with different shapes and scale parameters can be used to improve wind speed probabilistic forecasts (Sloughter et al., 2010).

The BMA predictive PDF is a summation of weighted PDFs of each individual ensemble member (Leamer, 1978; Kass & Raftery, 1995; Hoeting et al., 1999):

Where *f _{i}* is the ensemble member deterministic forecast,

*y*represents the forecasted variable, and

*are the characteristic parameters of the*

_{i}*i*th individual PDF from the

*i*th ensemble member. Each of these individual PDFs associated to each ensemble member is weighted based on the ensemble member’s relative performance during the training period. The weights

*w*are probabilities, i.e. non-negative and add up to 1. The BMA weights

_{i}*w*and the parameters

_{i}*are estimated by maximum likelihood (Wilks, 2006) using the training data. This estimate cannot be done analytically so an expectation maximization (EM) iterative algorithm is used (Dempster et al., 1977; McLachlan & Krishnan, 1997).*

_{I}Figure 24 illustrates the potentiality of BMA as a method for ensemble calibration. Continuous ranked probability score (CRPS) performance score (Hersbach, 2000) is represented for various ensemble predictions of surface temperature. CRPS index is understood to be for a probabilistic forecast the equivalent of the mean absolute error for a deterministic forecast. The different curves correspond to the raw ensemble and to the six ensembles calibrated using different statistical techniques. It is straightforward to see that the ensembles corrected with BMA perform better than any of the others.

### 6.2. Logistic Regression (LR) and Extended LR

The logistic regression and extended logistic regression techniques are described in detail in Wilks (2009). Logistic Regression (LR) approximates the Cumulative Distribution Function (CDF) of the predicted parameter *y* by the following equation (Wilks, 2009):

Where *q* is a selected prediction threshold and:

Being *{x _{1, …,} x_{n}}* the regression predictors and

*θ = {b*the unknowns to be estimated during the training process. Equation (11) has a characteristic

_{0}, b_{1, …,}b_{n}}*S*shape with values bounded on the

*0 < CDF(q) < 1*interval. The name logistic comes to the fact that the regression equation is linear on the logistic scale:

Typical predictors for LR when calibrating ensemble predictions are ensemble mean, ensemble spread or a function of them (Hamill et al., 2008). As thresholds *q*, using the representative climatological quantiles of the meteorological parameter *y* ensures a statistical uniformity in the process of regression.

According to Wilks (2009) *θ* unknowns are generally estimated using maximum likelihood (Wilks, 2006), but other estimation techniques could give better performance, for example the minimization of the continuous ranked probability score (Hersbach, 2000).

By construction Equation (13) is fitted separately for every threshold *q* and this fact involves several problems. We consider for example the parameter precipitation and two thresholds, *q _{1}* = 2mm and

*q*= 10mm. After the training we have two different regression equations for each threshold,

_{2}*f*and

_{1}(x)*f*, which in general are not parallel. The non-parallelism of the functions implies that for some values of the predictors

_{2}(x)*{x*these curves will cross leading to the unrealistic result of

_{1, …,}x_{n}}*CDF(q*. Another problem arises when we want to estimate the

_{1}) > CDF(q_{2})*CDF*of a threshold for which regressions have not been fitted. This process requires some kind of interpolation of

*CDFs*which is not statistically coherent. Finally, the more equations are to be fitted the more unknowns have to be estimated.

To overcome these problems, Wilks (2009) proposed a new approach to Equation (11) that consists of including a function *g(q)* in the exponent which increases with threshold *q*:

Thus, a unique regression estimation for any value of *q* is needed, which implies the parallelism of the functions *f(x)* for the different thresholds (the unknowns* {b _{0}, b_{1, …,} b_{n}}* are always the same). This approximation is known as Extended Logistic Regression (ELR).

It is important to point out as an advantage that LR (and ELR) has no statistical restriction to be used with non-Gaussian parameters such as precipitation or wind. At the same time this technique can be applied to ensembles whose members are non-distinguishable.

As an example of ELR, Schmeits and Kok (2010) calibrated ECMWF ensemble predictions of precipitation over an area that covers Netherlands (see the article for details). After studying the performances of different shapes for *g(q)* and the predictors, they select :

Being *x _{1}* the ensemble mean of the square root of the predicted precipitation. Then the equation to be regressed is:

Figure 25 represents a reliability diagram which compares the forecast probabilities of having precipitation lower or equal than 5 mm with the observed frequencies of this event. For a perfect reliable forecast all points would be in the diagonal, so in this case ELR calibration clearly improves the performance of the raw forecast.

### 6.3. Non-homogeneous gaussian regression

The non-homogeneous Gaussian Regression (NGR) technique was proposed by Gneiting et al. (2005b). In its general form, the predictive PDF estimated by NGR is assumed to be a perfect Gaussian with the mean value being a bias-corrected weighted average of the ensemble members forecasts and the variance a linear function of the ensemble variance. That is (Gneiting et al., 2005b):

Here *y* is the weather quantity to be predicted, *s* is the ensemble spread and *{x _{1}, …, x_{k}}* are the

*k*ensemble predictions of parameter

*y*.

*θ = (a, b*are the unknowns of this expression that have to be estimated by regression using the training data, which consist in a series of former forecast-observation pairs. The term non-homogeneous refers to the fact that the variances of the regression errors are not the same for all the values of

_{1}, …, b_{k}, c, d)*{x*(they depend on

_{1}, …, x_{k}}*s)*as it is assumed in linear regression.

This way of representing the predictive PDF allows a natural understanding of the regression coefficients. Coefficient *a* is a bias-correction of the ensemble weighted mean. The weights *{b _{1}, …, b_{k}}* can be negative but for an easier interpretation Gneiting et al. (2005b) recommended constraining them to be non-negative which is done during the training process. On one hand side, they represent the performance of the ensemble members over the training period, with respect to the other members. On the other hand, they also reflect the correlations between ensemble members. Gneiting et al. (2005b) showed how a five members ensemble with three members using the same global data as initial and boundary conditions (so highly correlated) is automatically reduced to a three members ensemble after NGR calibration leaving only non-zero weights for the members that use different data as initial conditions. Variance coefficients

*c*and

*d*are constrained to be non-negative and they are a measure of the spread-skill relationship. For large values of

*d*NGR variance is correlated to ensemble variance (s

*) so a significant correlation of the spread with the skill of the ensemble weighted mean is obtained. If spread and skill are independent of each other,*

^{2}*d*values will be negligible and it is

*c*what represents the variance of the NGR calibrated mean.

Compared to other techniques (e.g. BMA; Raftery, 2005), NGR has the advantage that can be applied to ensembles whose members are non-distinguishable, such as the ECMWF ensemble prediction system (Molteni et al., 1996). In this case, NGR is simplified by constraining the *b _{1}* = …

*= b*coefficients to be equal, which at the same time agrees with the assumption of equiprobability of members. Now the analytical PDF (17) is reduced to:

_{k}Where *x _{m}* is the ensemble mean.

Due to the Gaussian shape of the analytical expression (Eq. 17), this technique is expected to be especially useful for weather parameters that have Gaussian distributions such as temperature or pressure. Figure 26 represents a real experiment by Hagedorn et al. (2008) where GFS (Toth & Kalnay, 1997), ECMWF and a multi-model (combining GFS and ECMWF) ensemble predictions of surface temperature are calibrated using NGR all else being equal. Continuous Ranked Probability Skill Score (CRPSS) is used as a performance measure for probabilistic forecasts (Jolliffe & Stephenson, 2003). The higher its values (closer to 1), the better the probabilistic forecast will be. When CRPSS reaches 0 it means that the probabilistic forecast has the same skill than the climatology (Hagedorn et al., 2008; see for details). In this case, it is clear the benefit of calibrating the ensembles with NGR.

A classical and widely extended technique for estimating the *θ* unknowns is maximum likelihood (Wilks, 2006). Nevertheless Gneiting et al. (2005b) demonstrated, for NGR probabilistic forecasts of temperature and surface pressure, that estimating *θ* by minimization of the continuous ranked probability score (Hersbach, 2000) gives clearly a better calibration of the PDF.

The adequate length of the training period for an operational approach is not unique. Raftery et al. (2005) showed that by using the previous 25 days the prediction intervals are the narrowest maintaining the right coverage of the verification (see the article for details). However, Wilks and Hamill (2007) used 45 days. An experimental study of the optimal length for one or more specific locations is desirable.

### 6.4. Ensemble dressing

The dressing technique is a statistical post-processing technique based on combining each member of a dynamical ensemble with its own statistical error ensemble.

Roulston & Smith (2003) proposed the use of a simple resampling scheme called *best member* method: individual members of an ensemble are *dressed* with an error distribution derived from the error made by the *best* member of the ensemble. The best member is defined as the member that is closer to the verification and the uncertainty of it is the one that is added to the rest of members. Identification of the best member is performed by means of multivariate forecasts although only univariate forecasts are dressed. The number of forecast variables required is estimated by looking at the fraction of the false best members (FBM). These FBM are defined using a distance on the vector space of the verification. If the N ensemble members are described as d-dimensional vectors, *x _{i} (i=1,…,N)* and

*y*is the verification, the normalized distance is defined as (Roulston and Smith, 2003):

Where *d* is the number of forecast variables being considered and * _{k}* is the standard deviation of the

*k*th component of the forecast vector. The best member is the one which the minimum

Wang and Bishop (2005) showed by stochastic simulations that the best-member method can lead both to underdispersive or overdispersive ensembles. In addition to this, Wilks (2006) demonstrated that the dressed ensemble cannot be reliable. In order to alleviate these problems a new multivariate dressing method based on the second moment constraint is proposed. Ensemble bias is removed before building training statistics for the dressing kernel assuming that each ensemble is drawn for stochastic process. To dress the ensemble, statistical perturbations ε are added to each ensemble member. The covariance matrix *Q* is defined as (Wang & Bishop, 2005):

Where the columns of *E* contain the eigenvectors of *Q* and the diagonal matrix contains the corresponding eigenvalues. Positive eigenvalues indicate that the ensemble is underdispersive in the directions of the corresponding eigenvectors and thus dressing is necessary. The *Q* matrix can be expressed as a function of the ensemble member forecasts and the verification values. The new dressing perturbation generator is defined as (Wang & Bishop, 2005):

Where *, i = 1, 2,..., I,* are the eigenvectors corresponding to the positive eigenvalues. The coefficients *c _{i}* are univariate random variables generated from a normal distribution with mean equal to zero and variance equal to the

*i*th positive eigenvalue of

*Q*.

The comparison of the original best-member dressing method with the second moment constraint dressing method confirms that the spread of the best-member dressed ensemble is indeed underdispersive, or even becomes overdispersive, depending on factors such as the undressed ensemble size, how underdispersive the undressed ensemble is and the nature of the subspace from which the best member is identified. On the other hand, for underdispersive ensembles, the second moment constraint dressing kernel correction always returns about the right amount of dispersion.

Although underdispersion is a common characteristic of an EPS, some variables have an overdispersive behavior (Feddersen & Andersen, 2005). Fortín et al. (2006) proposed to dress and weight each member differently to improve the reliability of the forecast and to correct the variable under or overdispersed. This method is very similar to BMA (Raftery et al., 2005) and today has only been applied to one-dimensional variables.

## 7. A brief description of some current state-of-the-art ensemble prediction systems

The Australian Bureau of Meteorology (BoM), Brazilian Centro de Previsao do Tempo e Estudos Climatico (CPTEC), China Meteorological Administration (CMA), the European Centre for Medium-Range Weather Forecasting (ECMWF), Japan Meteorological Agency (JMA), Korea Meteorological Administration (KMA), Meteorological Service of Canada (MSC), Météo-France (MF), UK Met Office (UKMO) and US National Centres for Environmental Prediction (NCEP), among others, run ensemble prediction systems.

The ECMWF-EPS is a global ensemble that is optimized for the medium range. It uses the singular vector technique (Ensemble Data Assimilation is under research and is starting to be used operationally together with SV) for providing the set of initial perturbations, as well as stochastic parameterizations to account for model errors. The ECMWF-EPS comprises 51 members with 62 vertical levels and a spectral horizontal resolution of T639.

In the NCEP ensemble the initial perturbations are obtained by the Ensemble Transform with Rescaling (ETR) technique. It also uses stochastic perturbations to account for model errors. It runs 20 members with 28 levels and a spectral horizontal resolution of T126.

The MetOffice ensemble, called MOGREPS, works with 24 members. It uses the ETKF (Ensemble Transform Kalman Filter) technique with scaling of perturbations using radiosonde and ATOVS observations (Bowler et al., 2008). The horizontal resolution is 0.83 degrees in longitude and 0.56 in latitude, and the number of vertical levels is 70.

The Japan Meteorological Agency (JMA) ensemble has 51 members. It uses singular vectors for the calculation of the initial perturbations and stochastic representation of physical parameterizations for accounting model error. The number of vertical levels is 60 levels and the spectral horizontal resolution is T319.

A source of information for studying global ensembles is the TIGGE project, which is a key component of the THORPEX Interactive Grand Global Ensemble, a World Weather Research

Programme for improving the accuracy of high-impact weather forecasts. In the TIGGE project, the forecasts of 10 global ensembles are archived, permitting the comparison of methods and results.

Limited area ensembles are developed for higher resolutions (nowadays from 2 km up to 25 km) and shorter time ranges (from 18 to 72 hours) than those of global ensembles. When a model can explicitly resolve convection (due to its characteristics and high resolution configuration) it can represent more realistically typical precipitation patterns in the forecast field. However, convection forecasts (as well as other small scale processes) are very limited by their deterministic predictability (which is small due to its chaotic behaviour). Therefore, even in the short forecast range of only 24 hours, the prediction of details in the convection (such as location and timing of a thunderstorm) are usually very uncertain. Limited area ensembles can add information to deterministic high resolution forecasts and for this reason many operational weather centres are developing limited area ensembles.

Limited area ensembles running in operational centres are based on high resolution non-hydrostatic models, such as ALADIN (developed and maintained by a consortium of 16 National Meteorological Services, led by Météo-France), COSMO (developed by a consortium of seven NMS, led by Deutscher Wetterdienst), WRF (developed in the United States of America by NCAR, NOAA and others), HARMONIE (a model which shares code with ALADIN, developed by a consortium of 10 NMS) and the Unified Model (MetOffice). These ensembles can have an assimilation cycle which uses a wide class of meteorological observations, in some cases including radar data trying to represent as much as possible actual precipitation processes. They need lateral boundary conditions typically coming from global ensembles or coarser limited area ensembles.

Just to mention two examples of limited area ensembles, we resume below the characteristics of the COSMO-DE ensemble (Deutscher Wetterdienst) and the AEMET-SREPS. Other operational limited area ensembles are: the Norwegian targeted EPS LAMEPS (Frogner & Iversen, 2002), the Hungarian LAMEPS based on ALADIN (Hágel & Horányi, 2007), the multi-model GLAMEPS (ALADIN and HIRLAM consortium) and the limited area version of MOGREPS (MetOffice).

The COSMO-DE Ensemble is based on the convection resolving model COSMO-DE. It produces 2.8 km grid forecasts up to 18 hours, runs every 3 hours and assimilates estimated precipitation rates from RADAR.

The AEMET-SREPS uses the multi-model and multi-boundaries techniques for sampling initial, lateral conditions and model errors. It uses five global models for initial and lateral conditions and five limited area models running with every lateral and initial conditions (MM5, UM, HIRLAM, COSMO and HRM) thus producing 25 members. The horizontal resolution is 25 km and produces forecasts up to 72 hours twice every day. It covers a wide area (includes North Atlantic, Europe and North of Africa).

## 8. Conclusions and future directions

The most reliable and skilful theoretical forecasts from the current observed state of the atmosphere can be obtained through a Probability Distribution Function (PDF) which describes a comprehensive set of possible future states. The only practically feasible methodology to assess a forecasted PDF is using an Ensemble Prediction System (EPS), that is, a PDF sample of different but equally plausible Numerical Weather Prediction (NWP) forecasts (EPS members). Furthermore from the practical point of view EPS forecasts have been showed to be more reliable and skilful than a forecast from one single NWP model, even when the latter has a higher resolution.

This better performance is due to the fact that the set of non-linear equations which describe the future evolution of the atmosphere have a chaotic behaviour. This means that any uncertainty in the prediction process like two slightly different initial states could grow and lead to quite significantly distinct forecasted states. As a consequence, the predictability associated to any forecasted atmospheric state is always spatially and temporally limited but depending in each forecast on the uncertainty magnitude and the particular atmospheric situation.

The sources of errors and uncertainties which limit the predictability are mainly due to: a.) inaccuracies in the initial atmospheric state, estimated from available observations with their associated observational error and limited representativeness and imperfect assimilation systems, b.) inadequacies of the NWP models, related to dynamical NWP model formulation and physical parameterizations and c.) for Limited Area Model (LAM) EPSs, approximations and errors from Lateral Boundary Conditions. So any reliable and skilful EPS has to take into consideration all of these uncertainty sources by using different methodologies. Some of these methodologies are, for instance (and respectively): a.) singular vectors, bred vectors, Ensemble Transform Kalman Filter (ETKF) and Ensemble Data Assimilation (EDA), b.) multi-model, multi-physics or multi-parameterizations, multi-parameters and stochastic parameterizations, and c.) multi-boundaries.

In addition, as any forecasting system, EPS quality and value, that is the overall performance, has to be evaluated through objective verification, assessing the necessary and complementary set of properties (with the corresponding tools): consistency (rank histogram), reliability (spread-error and attributes diagrams), resolution (resolution component of Brier Score), discrimination (Relative Operating Characteristic curves), sharpness (Sharpness Histogram), skill (Brier Skill Score) and relative value (Relative Value Diagram). The main goal of verification, apart from assessing EPS forecast performance, is that EPS developers can be leaded to what and how to improve the EPS.

A number of EPS products, which take into account the forecast probabilities and the predictability concept, can serve the forecast guidance. The EPS products can be raw ones (e.g. stamps, plumes or spaghettis) or derived (e.g. ensemble mean and spread charts, probability and percentile maps, EPS-grams, clusters and extreme forecast indexes). Before the products production it would be advisable to calibrate the EPS in order to remove its systematic errors. Some statistical post-processing techniques for ensemble calibration are Bayesian Model Averaging, Logistic and Extended Logistic Regression, Non-homogeneous Gaussian Regression and Ensemble Dressing.

Regarding to future scenarios, the weather forecast process is expected to improve at all temporal scales, from the first hours to the climatic scales, and to be done at finer spatial scales, due to the fact of having more and better observations, better NWP models, increased supercomputer resources, etc. Even though it is expected a reduction of errors and uncertainties, they will be always present and limit the predictability. This means that the majority of ideas and methodologies explained in this chapter are going to be useful for the next generation of weather forecasting systems, although new ones are expected to be developed. Then some of the future directions of work are outlined.

From the point of view of the current experience in EPS development, the multi-model and multi-analysis (from independent Global NWP models) approach, has showed to have better performance than any theoretical methodology based on a single model. This fact means that there is not enough knowledge about the different model uncertainties and that there can even be other unknown sources of error. Anyway, the latter approach is expected to be used intensively in the next EPS generations and even to overcome the former approach.

On the other hand, the better the EPS performance seems to be, the more number of error sources are considered and even the more methodologies are used together. Thus, future EPS developments are going to follow this line, partly because it increases the spread counteracting the common EPS shortcoming, the underdispersion. Anyway particular attention has to be paid not to increase spuriously the EPS spread, that is, without increasing the skill.

As it has been mentioned before, other EPS developments will come from having better assimilation techniques, better generation of initial conditions and better methodologies to tackle model errors and uncertainties. This fact means that, in a foreseeable future, EPSs will become more complex.

Finally in the current and the next decade there will be an important increase in the horizontal and vertical resolutions of the EPSs linked to the NWP model developments for smaller grid spacing. Thus, spatial resolutions of the next Global and LAM EPSs generations are going to be, respectively, close to the non-hydrostatic scale (e.g. 8-16 km), and about the meso-gamma or convection-resolving scale (e.g. 1-4 km). One consequence of this will be that verification will have to evolve to an objected-oriented way (e.g. SAL or MODE techniques). Because of uncertainties grow faster as the resolved scales are smaller, due to a more intrinsic chaotic nature, another consequence will be that the only feasible methodology to forecast the weather at these scales will be ensemble forecasting.