Esmvaltool: How to deal with a diagnostic that takes as input several variables where at least one of them, but not all need to be available?

Created on 2 Apr 2019  Â·  8Comments  Â·  Source: ESMValGroup/ESMValTool

I have a diagnostic that looks something like this:

def my_diagnostic(cfg):
      dataset = read_dataset(cfg)
      selected_vars = [varA,varB,varC,varD]
      sum_of_vars = np.sum([var for var in dataset.variables if var in selected_vars])
      # after this I also write to the metadata which vars were actually present on this dataset 

How should I deal with this in the recipe? To me it looks as if one needs to specify for each diagnostic the variables that are needed as input. But what I would need is a conditional passing of variables, i.e., if the variable is on the dataset, pass it, but do not raise an error if on this particular dataset the variable is not present. Is this functionality available? If yes, that would be great. If no, I will think if I can rewrite my diagnostic such that this is not needed.

diagnostic help wanted

Most helpful comment

All variable/dataset combinations specified in the recipe have to exist. However, it is possible to specify datasets per variable by using the additional_datasets keyword at variable level, so there should be no problem in putting your use case in a recipe.

See e.g.:
recipe_cox18nature.yml
examples/recipe_variable_groups.yml

All 8 comments

if a variable is not needed don't put it in the recipe: no need for it - no need to spend cpu time and memory on any preprocessing that may be done on its data; if your diagnostic needs to check what variables are passed to it use the sorting functionality that assembles a dictionary keyed on variables eg:

   from esmvaltool.diag_scripts.shared import group_metadata

def main(cfg):
    input_data = cfg['input_data'].values()
    grouped_input_data = group_metadata(
         input_data, 'short_name', sort='dataset')
    all_vars = grouped_input_data.keys()

if __name__ == '__main__':

    with run_diagnostic() as config:
        main(config)

(whoops, forgot my signature pint :beer: )

Thanks for your help and cheers ! :coffee:

if a variable is not needed don't put it in the recipe

If a variable is present on the file, it is needed. But not every dataset has all variables present. My data exists of fractions of land cover. A few different landcover categories are defined. E.g. a category can be crops_and_grasses. Whereas model A has both cropFrac and grasFrac present, model B has only cropFrac present. My diagnostic, will take the sum of cropFrac and grasFrac for model A, but the 'sum' of only cropFrac for model B (this is the list comprehension in my example). In the metadata I keep track of what is summed exactly for this model. More calculations are done based on these categories of land cover, that's why I prefer not to write different diagnostic scripts based on what is available for each model.

Would this be possible with the currently implemented functionalities?

if your diagnostic needs to check what variables are passed to it use the sorting functionality that assembles a dictionary keyed on variables

I think the problem is that it is not possible to specify a variable which is present on some but not all of the available datasets. I tried this, but run into a RecipeError, indeed mentioning that the variable can not be found for that dataset.

( This issue arose while working on: https://github.com/ESMValGroup/ESMValTool-private/issues/171 )

All variable/dataset combinations specified in the recipe have to exist. However, it is possible to specify datasets per variable by using the additional_datasets keyword at variable level, so there should be no problem in putting your use case in a recipe.

See e.g.:
recipe_cox18nature.yml
examples/recipe_variable_groups.yml

Thanks a lot for this example Bouwe. Through making use of labels, it is possible to significantly reduce the number of input lines needed for my diagnostics. There is one thing that I do not get working yet, I try to demonstrate with a code sample:


DATASETS_LACKING_VAR_A: &lacking_a
  - {dataset: CanESM2}
  - {dataset: IPSL-CM5A-LR}

DATSETS_LACKING_VAR_B: &lacking_b
  - {dataset: MIROC5}
  - {dataset: MPI-ESM-LR}
  - {dataset: NorESM1-M}

# I want both of these lists of datasets to be passed towards the variable C in the diagnostic. 

diagnostics:
  diag_variable_groups:
    description: Trying to reference multiple labels for var C
    variables:
      varA: 
        additional_datasets: *lacking_b
      varB: 
        additional_datasets: *lacking_a
      varC:
        additional_datasets: 
          *lacking_a
          *lacking_b

This results in the following error:

yaml.parser.ParserError: while parsing a block mapping
  in "/home/crezees/evtrundir/recipe_landsurface_albedolandcover.yml", line 62, column 9
did not find expected key
  in "/home/crezees/evtrundir/recipe_landsurface_albedolandcover.yml", line 68, column 11

I tried several variations (e.g. [ *lacking_a , *lacking_b ]) for parsing, but didn't succeed.

I think you need to put a <<: in front of each of your starred labels and
place each on a new and indented line. I might be just blabbering tho,
Bouwe knows yaml better than I do :beer:

Dr Valeriu Predoi.
Computational scientist
NCAS-CMS
University of Reading
Department of Meteorology
Reading RG6 6BB
United Kingdom

On Wed, 24 Apr 2019, 09:06 bascrezee, notifications@github.com wrote:

Thanks a lot for this example Bouwe. Through making use of labels, it is
possible to significantly reduce the number of input lines needed for my
diagnostics. There is one thing that I do not get working yet, I try to
demonstrate with a code sample:

DATASETS_LACKING_VAR_A: &lacking_a

  • {dataset: CanESM2}
  • {dataset: IPSL-CM5A-LR}

DATSETS_LACKING_VAR_B: &lacking_b

  • {dataset: MIROC5}
  • {dataset: MPI-ESM-LR}
  • {dataset: NorESM1-M}

I want both of these lists of datasets to be passed towards the variable C in the diagnostic.

diagnostics:
diag_variable_groups:
description: Trying to reference multiple labels for var C
variables:
varA:
additional_datasets: *lacking_b
varB:
additional_datasets: *lacking_a
varC:
additional_datasets:
*lacking_a
*lacking_b

This results in the following error:

yaml.parser.ParserError: while parsing a block mapping
in "/home/crezees/evtrundir/recipe_landsurface_albedolandcover.yml", line 62, column 9
did not find expected key
in "/home/crezees/evtrundir/recipe_landsurface_albedolandcover.yml", line 68, column 11

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/ESMValGroup/ESMValTool/issues/1010#issuecomment-486115453,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AG5EFI5GUIAKEXTN6WZWMT3PSAIIBANCNFSM4HDAYZHQ
.

Thanks Valeriu! :tea:

I did not succeed yet. The problem seems to be that the << (merge key) can only be used for mappings, whereas our datasets are defined as a sequence. See e.g. here. @bouweandela any ideas?

It looks like the short syntax you want is not possible with YAML, see e.g. this feature request https://github.com/yaml/yaml/issues/35, so you may end up having to write a few more lines of YAML.

It looks like the short syntax you want is not possible with YAML, see e.g. this feature request yaml/yaml#35, so you may end up having to write a few more lines of YAML.

Ok, thanks, good to know that it is not possible at the moment.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bouweandela picture bouweandela  Â·  4Comments

jonnyhtw picture jonnyhtw  Â·  4Comments

lukasbrunner picture lukasbrunner  Â·  4Comments

valeriupredoi picture valeriupredoi  Â·  3Comments

axel-lauer picture axel-lauer  Â·  5Comments