Esmvaltool: More flexibility in recipe datasets

Created on 19 Oct 2018  路  17Comments  路  Source: ESMValGroup/ESMValTool

Hi,

I'm in the process of setting up a marine biogeochemistry recipe. One of the issues is that there are no standard sets of BGC fields. For instance, not all models use the same nutrients, (no3, po4, iron, silicate, etc...). Similarly, the models have different phytoplankton functional types, and so on.

Is there a way to set up a diagnostic variable where it runs only over the available files, and doesn't fail if ESMValTool can't find a combination of dataset and variable? I'd rather not have to set up individual lists of models for each variable.

Similarly, as mentioned in the issue https://github.com/ESMValGroup/ESMValCore/issues/76, it would be great to have a similar flag for times. I'd like to run over all available model time without having to set the exact time range every time for each model, for each variable. (I'm debugging this recipe locally and I have not made a full copy of all the relevant cmip5 model data to my local machine.)

Do these features exist already, or is this a feature request?

enhancement

All 17 comments

@ledm you can add the datasets that are specific to a specific variable as additional_datasets eg in diagnostic, when specifying variable parameters:

  diagnostic2:
    description: Another Python tutorial diagnostic.
    variables:
      tas:
        preprocessor: preprocessor2
        field: T2Ms
        additional_datasets:
          - {dataset: bcc-csm1-1,  project: CMIP5,  mip: Amon,  exp: historical,  ensemble: r1i1p1,  start_year: 2000,  end_year: 2002}
        reference_dataset: MPI-ESM-LR
    scripts:
      script2:
        script: examples/diagnostic_object_oriented.py
        quickplot:
          plot_type: contourf

(taken from example recipe)

Thanks for the reply, @valeriupredoi, but that not what I'm asking though.

I want to set a list of global models once - in the dataset section of the yml. Then, I want ESMValTool to try and find them, but not throw a hissy fit (fatal error) if it can't find them. If ESMValTool can't find all the models, I still want it to run over the ones it did find.

man @ledm you can't just throw a whole bunch of datasets in the datasets section of the recipe and hope for the best that some will be found for some variables and others will just be ignored - you gotta put the ones that you know are found for all variables and add the ones that are specific to certain variables under those variables. This is something we will not implement because a lot of the recipes make use of specific datasets and the diagnostics running on those datasets change significantly if one or two or ten datasets are missing (think of multimodel stuff or stuff that use reference models). What we could do we could run a first pass through the datasets and dont execute any diagnostic but rather spit out information like: you requested 20 datasets for 3 variables, you are missing such and such dataset for such and such variable, and then you can adjust your recipe. @bouweandela @mattiarighi what say yous?

Okay, how about this scenario: each model uses a different set of ensemble members. If I want to produce an compare then ensemble mean of several models, at the moment we have to specify in advance every ensemble member?

I'm not arguing for this to be the default behaviour, but you must admit that sometimes the shotgun approach is the right way to go. (Not always though!) This approach would make data exploration much simpler for users.

ie, I'd like to be able to do something like this in my recipe:

  • {dataset: HadGEM2-ES, project: CMIP5, exp: historical, ensemble: "r?i?p?", start_year: *, end_year: *}

The ESMValTool interface would become much more user-friendly if we could use wildcards (?, *).

Some time ago we discussed about implementing the ability to exclude datasets from specific diagnostics. I can not find that discussion or rememeber if this was implemented.

The idea was to use the exclude and only to turn on and off specific dataset in some diagnostics.
This feature was available in v1, but not supported in v2 since the same result could be obtained with the existing capabilities of the yml format.

I've had a similar problem of "I want to include as many models as possible that have this variable" and just had the quick and dirty approach of writing a "model finder" program utilizing some of the _data_finder and config_developer parts:
https://seafile.zfn.uni-bremen.de/d/be5e30153dfa4cf696a0/
(Beware shitty low level programming with hardcoding, I barely write anything in python)

Currently it runs on the DLR and the DKRZ and assumes either data structure, which you could easily adjust, as well as the hardcoded paths, by just reading the data in config_developer. Using the settings in the uploaded file the print output would be a list (shortened):

# Following models and timeranges found for search query:
# Project: CMIP5, MIP: Amon, field: T2Ms, Experiment: rcp45, ensemble: r1i1p1, realm: atmos, variable: tas
# Specified timerange: None - None, Inclusion ok: False
# Using server: DKRZ, output_format: v2
      - {dataset: ACCESS1-0, start_year: 2006, end_year: 2100}
(...)
      - {dataset: NorESM1-ME, start_year: 2006, end_year: 2102}

which I just copy pasted into the recipe. If we don't want to introduce wildcards in the recipe, I could see the implementation of something similar in the utils folder instead of V's proposition. This could be more flexible and be used outside the ESMValTool as well.

What we could do we could run a first pass through the datasets and dont execute any diagnostic but rather spit out information like: you requested 20 datasets for 3 variables, you are missing such and such dataset for such and such variable, and then you can adjust your recipe.

I think this is the way to go.
And the utility suggested by @bettina-gier could be the way to implement this.

If you want to check the system for available data scanning the entire data tree for each query might be inefficient. There were some offline discussions in the past with @cehbrecht to apply a search engine for that purpose (explicitly Apache Solr). Once in place this would allow to ask the questions raised above from within python and get the response immediately, just like the search on the esgf but local.

by me this is indeed fixed by #587 , can we close us this issue?

The utility created by @bettina-gier looks quite practical. Would it be worth putting together a command line tool based on this inside ESMValTool? This would allow us to explicitly output the exact expressions needed in a recipe.

Just came across this thread when I was looking for similar functionality. I agree that it would be nice to have this ship as a utility with ESMValTool. Any updates or new insights on this?

While I have not done any more work on the primitive tool above, @debe-kevin has been developing a recipe completion tool that takes a template recipe as input. Once it's ready, there will be a pull request here! SoonTM

I've written up a script that has worked in the past, but is not currently working. Shouldn't take too much effort it bring it up to speed.

However, I'm not 100% happy with the interface. It may need some work to get it to be truly flexible.

@ledm That sounds like a great start. Can you make a (draft) PR to add it to the utils directory? Then we can further tune it there.

Created a PR with my recipe-filler: #1707

Was this page helpful?
0 / 5 - 0 ratings

Related issues

valeriupredoi picture valeriupredoi  路  4Comments

axel-lauer picture axel-lauer  路  5Comments

valeriupredoi picture valeriupredoi  路  3Comments

chris-to-pher picture chris-to-pher  路  3Comments

bascrezee picture bascrezee  路  5Comments