Esmvaltool: Multimodel mean and median statistics

Created on 8 Dec 2017  ·  40Comments  ·  Source: ESMValGroup/ESMValTool

When calculating multi-model statistics, we have to consider the possibility that the models span different time ranges. Consider for example the following case:

MODEL1 2000 2010
MODEL2 1998 2009
MODEL3 2002 2012

I would suggest introducing two options:

  • overlap: the multimodel statistics is computed only in the overlapping time range (2002-2009)
  • full: the multimodel statistics is computed over the maximum time range (1998-2012), using the available model in each year (only MODEL2 in 1998-1999, MODEL1 and MODEL2 in 2000-2001, all models in 2002-2009, etc.)

The resulting statistics will need to be added to the models list as an additional model called "multi-model-mean" or "multi-model-median".

Comments and suggestions welcome!

Most helpful comment

I think the cmorization makes sure that all models have the same time units (@jvegasbsc is that correct?)

Yes, that's correct. We can have problems with calendars when working with daily or hourly frequencies, but I really don know how to treat them. For example, how can we calculate a multimodel mean in daily frequency for no_leap and standard calendar models? What can we do with 29th February? And with 360_day models it gets even worse

All 40 comments

sure thing, Mattia, I'll start working on Monday, but I am taking holidays Tue 12 Dec thru Fri 15 Dec, so will continue on the week starting Monday 18 Dec. Cheers, V

Suggestion:
There could be an additional output to the multimodel statistics that communicate the "sample size" , i.e. a timeseries of how many models contributed to a particular multimodel statistic. This would enable the diagnostic developers to set a threshold like: "only consider multimodel mean when there is more than one model".

hence the 'multi' bit in multimodel :dancer:

We would also need an exclude option, to make sure that, e.g., observations are not considered in the mean/median. Something like exclude = [ERA-Interim, NCEP].

would it make sense to always exclude observations, or is that too magic-like?

Regardless, perhaps it is safer to specify an include (e.g. include = project:CMIP5) to filter on only CMIP5 models...

Usually you would exclude the reference model(s), as specified in the variables dict.
So ideally you would write exclude = reference_model, but maybe an explicit declaration exclude = [model1, model2] is less confusing.

Regardless, perhaps it is safer to specify an include (e.g. include = project:CMIP5) to filter on only CMIP5 models...

This would also be confusing, especially if you mix models from different projects.

Hey guys, so basically here's the code for multimodel mean/median -- but I have two questions pour vous:

logger = logging.getLogger(__name__)

def get_overlap(cubes):
"""get discrete time overlaps"""

all_times = [cube.coord('time').points for cube in cubes]
bounds = [range(int(np.min(b)), int(np.max(b)) + 1) for b in all_times]
time_pts = reduce(np.intersect1d, (i for i in bounds))
t1 = time_pts[0]
t2 = time_pts[-1]
return t1, t2

def multi_model_mean(cubes, span, exclude):
"""Compute multi-model mean"""

cubes = [c for c in cubes]
if exclude != 'None':
    for cube0 in cubes:
        if 'model_id' in cube0.attributes.keys():
            if cube0.attributes['model_id'] in exclude:
                cubes.remove(cube0)

Q1: how to identify the model we want to exclude if there's no model_id in attributes?

means = []
medians = []
tx1, tx2 = get_overlap(cubes)
for cube in cubes:
    for coord in cube.coords():
        if coord.standard_name == 'time':
            if span == 'overlap':
                all_times = list(cube.coord('time').points)
                a1 = min(all_times, key=lambda x:abs(x - tx1))
                a2 = min(all_times, key=lambda x:abs(x - tx2))
                i1 = all_times.index(a1)
                i2 = all_times.index(a2)
                cmean = np.mean(cube.data[i1:i2,:,:])
                cmedian = np.median(cube.data[i1:i2,:,:])
                means.append(cmean)
                medians.append(cmedian)
            elif span == 'full':
                cmean = np.mean(cube.data[:,:,:])
                cmedian = np.median(cube.data[:,:,:])
                means.append(cmean)
                medians.append(cmedian)

logger.info("Global means: %s" % str(means))
logger.info("Global medians: %s" % str(medians))

Q2: what do you want to do with there? -- write a cube?

Cheers,
V

Q1: how to identify the model we want to exclude if there's no model_id in attributes?

Can't we use the same method used to identify the reference_model?

Q2: what do you want to do with there? -- write a cube?

The result shall be treated as a new model called "MultiModelMean" and/or "MultiModelMedian" and the models dictionary shall be updated accordingly.

Hi V.

Here's how I would do it:

Q1: how to identify the model we want to exclude if there's no model_id in attributes?

The argument exclude of the multi_model_mean/median function should be a dictionary containing cube attribute name/value pairs of that identify cubes to exclude, e.g. exclude={'model': ['NCEP', 'ERA-Interim']}. This leaves the job of choosing the criteria for selection to the caller of the function, in our case the namelist parser.

Q2: what do you want to do with there? -- write a cube?

I think it would be good if you could add a filename argument to the function call.
If the cubes would not need further processing, I would simply save it to filename.
If the cubes would need further processing, do the following: add the cube with the mean/median to the list of cubes you got as input and return that list. I changed the way cubes are saved a bit in the latest PR #188 (they are saved to the filename specified in the _filename attribute). Maybe you could add such an attribute to the mean/median cube(s)? There is no need to actually save the cubes, they will be passed on to the next preprocessor function by the preprocessor runner and the very last function is save_cubes, which saves all cubes in the list to the filename specified in their _filename attribute.

Right - cheers! But it's still not straightforward to me how to exclude the cube with the name from exclude: {model: ['BLAH', 'BLAHBLA']} if eg the cube has not 'BLAH' as attribute (cube.attributes has no key 'model_id') -- where is the bit that does the similar thing for reference_model as suggested by Mattia.

Also, the filename should be in the namelist or just as a function argument?

Cheers,
V

overlap:

def get_overlap(cubes):
"""get discrete time overlaps"""

all_times = [cube.coord('time').points for cube in cubes]
bounds = [range(int(np.min(b)), int(np.max(b)) + 1) for b in all_times]
time_pts = reduce(np.intersect1d, (i for i in bounds))
t1 = time_pts[0]
t2 = time_pts[-1]
return t1, t2

multi function:

def multi_model_mean(cubes, span, exclude):
"""Compute multi-model mean"""

cubes = [c for c in cubes]
proc_cubes = [c for c in cubes]

if exclude != 'None':
    for cube0 in proc_cubes:
        if 'model_id' in cube0.attributes.keys():
            if cube0.attributes['model_id'] in exclude:
                proc_cubes.remove(cube0)
if len(proc_cubes) > 1:

    means = []
    medians = []
    tx1, tx2 = get_overlap(cubes)

    for cube in proc_cubes:
        for coord in cube.coords():
            if coord.standard_name == 'time':
                if span == 'overlap':
                    all_times = list(cube.coord('time').points)
                    a1 = min(all_times, key=lambda x:abs(x - tx1))
                    a2 = min(all_times, key=lambda x:abs(x - tx2))
                    i1 = all_times.index(a1)
                    i2 = all_times.index(a2)
                    cmean = np.mean(cube.data[i1:i2,:,:])
                    cmedian = np.median(cube.data[i1:i2,:,:])
                    means.append(cmean)
                    medians.append(cmedian)
                elif span == 'full':
                    cmean = np.mean(cube.data)
                    cmedian = np.median(cube.data)
                    means.append(cmean)
                    medians.append(cmedian)

    logger.info("Global means: %s" % str(means))
    means_cube = iris.cube.Cube(means, long_name='means')
    cubes.append(means_cube)
    logger.info("Global medians: %s" % str(medians))
    medians_cube = iris.cube.Cube(medians, long_name='medians')
    cubes.append(medians_cube)

else:

    logger.info("Single model in list: will not compute statistics.")

cubes = tuple(cubes)

return cubes

Cheers,
V

Sweet! This looks awesome!
Cheers,
V

On Tue, Jan 16, 2018 at 4:06 PM, Bouwe Andela notifications@github.com
wrote:

Integrated this function in this branch: https://github.com/
ESMValGroup/ESMValTool/tree/REFACTORING_multi_model_mean


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/ESMValGroup/ESMValTool/issues/168#issuecomment-358012467,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AbpCo7qHg7ywU499PcWtDID3G89aDDslks5tLMkJgaJpZM4Q7Nrp
.

--
Dr. Valeriu Predoi
Computational Scientist for UKESM Core Team
Department of Meteorology, University of Reading
Earley Gate, Office 1U08
READING, RG6 6BB
United Kingdom
Mobile number: 07847416092

"If one day you be questioning your ability to come up with professional
results, think of this: Noah's ark was built by farmers whereas the Titanic
was crafted by skilled engineers"

Just tested it and it writes the nc file with medians and means from either
full or overlap spans, it excludes as well. Cheers, Bouwe!

There is a question though -- median values can be (quite frequently)
masked, Mattia, how do you want to handle these, just get all the numerical
values only and find the median? That would mean medians at different
positions in the data though

Cheers,
V

On Tue, Jan 16, 2018 at 4:21 PM, Valeriu Predoi valeriu.predoi@gmail.com
wrote:

Sweet! This looks awesome!
Cheers,
V

On Tue, Jan 16, 2018 at 4:06 PM, Bouwe Andela notifications@github.com
wrote:

Integrated this function in this branch: https://github.com/ESMValGroup
/ESMValTool/tree/REFACTORING_multi_model_mean


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/ESMValGroup/ESMValTool/issues/168#issuecomment-358012467,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AbpCo7qHg7ywU499PcWtDID3G89aDDslks5tLMkJgaJpZM4Q7Nrp
.

--
Dr. Valeriu Predoi
Computational Scientist for UKESM Core Team
Department of Meteorology, University of Reading
Earley Gate, Office 1U08
READING, RG6 6BB
United Kingdom
Mobile number: 07847416092

"If one day you be questioning your ability to come up with professional
results, think of this: Noah's ark was built by farmers whereas the Titanic
was crafted by skilled engineers"

--
Dr. Valeriu Predoi
Computational Scientist for UKESM Core Team
Department of Meteorology, University of Reading
Earley Gate, Office 1U08
READING, RG6 6BB
United Kingdom
Mobile number: 07847416092

"If one day you be questioning your ability to come up with professional
results, think of this: Noah's ark was built by farmers whereas the Titanic
was crafted by skilled engineers"

There is a question though -- median values can be (quite frequently) masked, Mattia, how do you want to handle these, just get all the numerical values only and find the median? That would mean medians at different positions in the data though

Yes, median (and mean) shall be calculated using valid data only (i.e., excluding missing/masked values).

The mask should be pretty much consistent across models after masking has been applied, right?

Hey guys, just pushed the fixes for multi_model_mean (_multimodel.py) on the branch created by Bouwe yesterday (REFACTORING_multi_model_mean) -- have a look and test away pls :)
Cheers,
V

Great! :clap:

I would suggest to submit a pull-request, so we can directly comment on the code.
Will try to test it by the end of the week, pretty busy right now :disappointed:

Sure thing, will do a PR in a jiffy (added a few more things including the
run once decorator and the NCL version check suggested by Bouwe so the
output can be piped). But before that I wanted to ask you one thing. As it
is now, this bit in namelist.py

    for key in 'reference_model', 'alternative_model':
        if key in variable:
            exclude_models.add(variable[key])
    exclude = {
        '_filename': [
            get_output_file(v, config_user['preproc_dir'])
            for v in variables if v['model'] in exclude_models
        ]
    }
    settings['multi_model_mean']['exclude'] = exclude

will always exclude reference models and alternative models if present in
namelist, is this okay or should we change it so sometimes we include those
models to multimodel statistics computations?

Cheers,
V

On Wed, Jan 17, 2018 at 2:18 PM, Mattia Righi notifications@github.com
wrote:

Great! 👏

I would just submit a pull-request, so we can directly comment on the code.
Will try to test it by the end of the week, pretty busy right now 😞


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/ESMValGroup/ESMValTool/issues/168#issuecomment-358317752,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AbpCo6WZcREycKm0tKofMKnOo8AARbp3ks5tLgE0gaJpZM4Q7Nrp
.

--
Dr. Valeriu Predoi
Computational Scientist for UKESM Core Team
Department of Meteorology, University of Reading
Earley Gate, Office 1U08
READING, RG6 6BB
United Kingdom
Mobile number: 07847416092

"If one day you be questioning your ability to come up with professional
results, think of this: Noah's ark was built by farmers whereas the Titanic
was crafted by skilled engineers"

will always exclude reference models and alternative models if present in namelist, is this okay or should we change it so sometimes we include those models to multimodel statistics computations?

That's indeed the most frequent case, since ref/alt model are usually the observations, so it could be a good idea to exclude them by default.

I'm just wondering whether it is better to make this more transparent to the user, i.e. to have it set explicitly as an option (e.g., exclude = [reference_model, alternative_model]).

Some additional aspects to be improved:

  • [x] the resulting variables means and medians are 2D fields, but they should contain the time coordinate as well, i.e. the mean/median shall be calculated individually for each time-step.

  • [x] Use _FillValue (1.e+20) instead of zeros for masked values (as for the models).

  • [x] we should have separate output files for the mean and the median, and we need to choose a proper file naming convention (including at least time period). @bouweandela are there specific requirements in the new launchers interface?

  • [x] the diag-scipts do not "see" the new model (i.e., multi-model mean) and as a consequence it is not processed together with the other ones. I guess this relates to #182 @bouweandela ?

  • [x] the preproc dictionary has now a new key multi_model_mean, although the preprocessor can calculate both mean and median. We could either have 2 entries multi_model_mean and multi_model_median (but this may grow too much if we implement more stats) or just rename it to multi_model_stats (but then we need to add another option to specify which stats is/are calculated: stats: [mean, median]). I would prefer the latter solution.

  • [x] are the reference_model and the alternative_model excluded by default or shall they be explicitly given in the exclude option?

I created a new branch REFACTORING_MMM_fix for these issues.

Cheers, Mattia! The way things are now in https://github.com/ESMValGroup/ESMValTool/blob/REFACTORING_MMM_fix/esmvaltool/preprocessor/_multimodel.py differ a bit from what case we look at -- either full or overlap time spans between models:

  • if overlap, the code looks for only the common time points between all models; this common span is computed with a one hour accuracy (do we need greater accuracy?);
  • if full, the code doesn't care the amount of time each model has data for;
  • once the boundaries of this common time stretch interval are found (or conversely, the full model time), each cube is sub-setted on this time (or not, if 'full');
  • then on, the iris.analysis.MEAN is computed for each model on this time slice (or the full model time);
  • the last step is to compute the mean and median on all the models, using the time-averaged slices, resulting mean/median cubes with data in only (lat, lon) arrays;
  • the number of time points that the time averages have been computed per model are stored for each model in the data_size array;

Is this correct, from the point of view of data processing needed?

[Edit: restored original comment, edited by mistake]

if 'overlap', the code looks for only the common time points between all models; this common span is computed with a one hour accuracy (do we need greater accuracy?);

It should be OK for monthly data. We can refined it later if needed.

then on, the iris.analysis.MEAN is computed for each model on this time slice (or the full model time);

This is the point: the model should not be averaged in time, each model should stay on its original time coordinate.

the last step is to compute the mean and median on all the models, using the time-averaged slices, resulting mean/median cubes with data in only (lat, lon) arrays;

Correct, but the time coordinate shall be included as well (also other coordinates, such lev, if available).

Spatial coordinates are straightforward, since models are expected to be on the same vertical/horizontal grid after regridding.
Time coordinate is also straightforward in the overlap case.
The full option might be more tricky, since for a given timestep not all models are necessarily available (see my very first post in this issue).

Cool - thanks, Mattia! So in 'overlap' mode is easy (and correctly done now, well, apart from median that should not be of time-averaged models); 'full' then will have the time axis of the least time-restricted model and means and medians that will have to be weighted by a set of weights built using the number of time points available for each spatial point for each model. The weights array is a 2D lat-lon array, like a mask, that will be written to the nc file. I'll crack on with this if that's OK

full then will have the time axis of the least time-restricted model and means and medians that will have to be weighted by a set of weights built using the number of time points available for each spatial point for each model.

Good point! I hadn't thought about weighting, but indeed it should be there.

The weights array is a 2D lat-lon array, like a mask, that will be written to the nc file.

Aren't they spatially constant?

Aren't they spatially constant?

yes, sorry, bad phrasing: 1D array: with the length of the longest available (full) time axis, and number elements of at least 2 or masked for those positions where data is available from only one model (the longest in time model, obviously)

Hey guys, the code https://github.com/ESMValGroup/ESMValTool/blob/REFACTORING_MMM_fix/esmvaltool/preprocessor/_multimodel.py now handles all aspects required by this issue, with a few more things to be taken care of:

  • it works for overlaps with time coords only with "days since 1950-1-1" units, but this should be easily fixed for more general cases;
  • the overlaps are constructed with 1h precision; this should be a variable depending on what sort of data we have;
  • in 'full' mode there are no weights applied to the statistic; I will fix that for a later release;
  • mean and median are still written to the same 'multi_model_mean.nc' cubelist, but that's easily deconvoluted once we agree on what the best functions options should be; in fact, now we can ask for any sort of multimodel statistic and easily implement its mathematics in _compute_stats()

Test away and shoot me observations/suggestions/pints of beer if it works :)
Cheers, V

it works for overlaps with time coords only with "days since 1950-1-1" units, but this should be easily fixed for more general cases;

Should be OK. I think the cmorization makes sure that all models have the same time units (@jvegasbsc is that correct?)

the overlaps are constructed with 1h precision; this should be a variable depending on what sort of data we have;

Most of the data are monthly mean. We would need to test it with daily/hourly data, but it is OK for now.

in 'full' mode there are no weights applied to the statistic; I will fix that for a later release;

Fine.

mean and median are still written to the same 'multi_model_mean.nc' cubelist, but that's easily deconvoluted once we agree on what the best functions options should be; in fact, now we can ask for any sort of multimodel statistic and easily implement its mathematics in _compute_stats()

This is an issue: we need to have them in separate files, to make them available to the diagnostics in the framework @bouweandela is developing. Each stats has to be treated as a separate model.

We shall define an output filename for the multimodel stats using similar keys as in the models dictionary of the yaml namelist.
I would suggests:

[project]_[model]_[type]_[field]_[var]_[start_year]-[end_year]

with [project] = BACKEND, [model] = MultiModelMean or MultiModelMedian and [type] =fulloroverlap`.

This way, we can update the model list in the temporary file and the diagnostics would get a list of models including the ones created by the multimodel.

I think the cmorization makes sure that all models have the same time units (@jvegasbsc is that correct?)

Yes, that's correct. We can have problems with calendars when working with daily or hourly frequencies, but I really don know how to treat them. For example, how can we calculate a multimodel mean in daily frequency for no_leap and standard calendar models? What can we do with 29th February? And with 360_day models it gets even worse

Hi Javier, all, (sorry, this thread is getting boringly long) -- I am seeing differences in time units in the final cubes after preprocessor (rght before multimodel stats is done), namely the calendars differ only e.g.:

CMIP5_GFDL-ESM2G_Amon_historical_r1i1p1_T3M_ta_2000-2002.nc: Unit('day since 1950-01-01 00:00:00.0000000', calendar='noleap')
CMIP5_MPI-ESM-LR_Amon_historical_r1i1p1_T3M_ta_2000-2002.nc: Unit('day since 1950-01-01 00:00:00.0000000', calendar='proleptic_gregorian')

Is this an issue I should be concerned about? Cheers!

It should not be an issue, different models use different calendars and the diagnostics can handle it.
There is also no way to set a standard calendar for all models, since it is not always possible to convert from one to another (especially if you use daily data).

cheers, noted! V

Hey guys, finally, after about 2 months of this issue's existence (well, a month was Xmas-ing and eating and drinking :P) I got to a level to which I am happy with the results: https://github.com/ESMValGroup/ESMValTool/blob/REFACTORING_MMM_fix/esmvaltool/preprocessor/_multimodel.py does what's written on the box: multimodel means and medians, while taking care of plev masks, and spits out numerical results for cases where we have two or more models with valid data (not missing) for any given plev (if more than two); it does this in either overlap or full mode; it also does it with the capability of extracting any set of models from the overall set. This latest implementation also averts memory issues from chucking in many large multidimensional arrays into a single bucket and computing statistics on it. I'd still recommend running in multithread just in case -- the memory intake is still large (about 500MB/model array with dims 35x15x300x500 and similar mask) but I dont think we can go any lower unless we start reducing the float sizes... Last thing to be done now is to separate the means and medians to be written in two separate nc files, but that should be easy peasy :)

Flash news :)
The new updated version of this module is here: https://github.com/ESMValGroup/ESMValTool/blob/REFACTORING_finalize_multimodel/esmvaltool/preprocessor/_multimodel.py

this addresses the following previously standing issues:

  • use of datetime objects and no more by-hand conversions between calendars (in line with https://github.com/ESMValGroup/ESMValTool/issues/268 );
  • improved cube slicing method (previously suggested by Bouwe) but it's a slight change from B's suggestion since that implied direct cube slicing and since that also involved appending to a list, that is greatly inefficient; the current implementation is fast and adds virtually no extra memory burden;
  • compatibility with Python 3
  • ease of adaptation for different data frequencies (as is now, it should theoretically work with daily data as well, but I havent tested it with thay sort of data)

Question for @mattiarighi -- do you want the ref models not to be excluded by default (currently they are)

Yes, I think the excluded models should be clearly visible to the user:

  multi_model_statistics:
    span: overlap
    statistics: [mean, median]
    exclude: [ERA-Interim, NCEP]

or maybe better

    exclude: [reference_model, alternative_model]

?

probably

exclude: [reference_model, alternative_model]

is more general so one doesn't have to keep track of all the specific model names and change them only in one place and carry on the reference to them?

Definitely, that was my idea too.

But then, for consistency, the same should also be applied to other preprocessor keys, like target_grid and levels (for the latter see also here).

this: https://github.com/ESMValGroup/ESMValTool/commit/dccb0e1a645056edd0fc58ae467a1094aad24e89

now it's all up to the human to decide what's to be excluded: example of exclude list:
https://github.com/ESMValGroup/ESMValTool/blob/REFACTORING_finalize_multimodel/esmvaltool/namelists/namelist_perfmetrics_CMIP5_short.yml

so you can say exclude: [GFDL-ESM2G, reference_model, alternative_model] or exclude: [GFDL-ESM2G, reference_model, NCEP] or any combination therein or nothing (when's nothing, nothing will be excluded)

:grinning: :grinning:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

valeriupredoi picture valeriupredoi  ·  3Comments

valeriupredoi picture valeriupredoi  ·  4Comments

jhardenberg picture jhardenberg  ·  5Comments

BenMGeo picture BenMGeo  ·  5Comments

nielsdrost picture nielsdrost  ·  3Comments