Esmvaltool: Monitoring on-going model runs

Created on 3 Sep 2018  路  9Comments  路  Source: ESMValGroup/ESMValTool

Hi,

this may be beyond the current remit of ESMValTool, but @valeriupredoi and I just discussed a way to make ESMValTool more useful for monitoring on-going model runs.

ESMValTool seems to run over the entire dataset, in full, every time. As far as we are aware, there is no way to "update" ESMValTool results with new model data. In an operational monitoring context, this is incredibly wasteful and expensive, as only a few years of model output are added every day.

In order to use ESMValTool for monitoring on-going runs, @valeriupredoi and I don't think it would take a significant change to the code base. ESMValTool would need to run a check at the start of the preprocessing stage to look for older versions of a preprocessor output. Then it would need to figure out which data are missing, then preprocess the new data and append it to the old preprocessed data.

Please let me know if this is a feasible idea?

Speaking as a model developper, I think that making ESMValTool capable of monitoring on-going runs would make a significantly more convincing argument for deploying the toolkit. At the moment, people will set up their own add hoc tools to monitor their model development. They will get used to their ad hoc tools, and become very comfortable with that toolkit. Then, when the model run is finished, they will not want to switch to a new and more complex tool like ESMValTool; they'll try to keep using their own tools. If ESMValTool were available for all stages of model development, many more people would use it.

Lee

enhancement standards

Most helpful comment

I also want ESMValTool to replace our monitoring tools, so you can count with my support on this.

We should implement it if possible with the functionality to reuse previously pre-processed output that was in version 1.0 but it is still not available in 2 (very important for recipe development). The main thing here is how to decide if we should recompute, extend or reuse the preprocessing

One note, I think this is something that we can only enforce on the preprocessor, implementing this at diagnostic level will be too ambitious

All 9 comments

I also want ESMValTool to replace our monitoring tools, so you can count with my support on this.

We should implement it if possible with the functionality to reuse previously pre-processed output that was in version 1.0 but it is still not available in 2 (very important for recipe development). The main thing here is how to decide if we should recompute, extend or reuse the preprocessing

One note, I think this is something that we can only enforce on the preprocessor, implementing this at diagnostic level will be too ambitious

I agree about the diagnostic level. That should be left up to the diagnostics writer to implement, if they want re-useabililty.

However, re-using preprocessed data will not always make sense in all cases. For instance, if you're monitoring a job and want the preprocessor to produce the time average of ie final five years of running. In this case, it would not make sense to load the previous result.

Also, I wasn't aware that this was a feature in v1 (I'm still new to ESMValTool.), can we re-use any of the older methods?

the simplest and probably the crudest way is to write a cron script that runs at certain time intervals and edits the recipe with incremental time chunks as data becomes available and runs esmvaltool with the new recipe each time. This must use a local data storage and not the ESGF node; one also needs to turn on the preproc removal. Quite crude in the sense that it will repeat all the processing that has been done one time step before, but at lest it is very easy to implement and entails absolutely no change to the current backend

Here at BSC we run the model by chunks of several months / years using a workflow manager. We have another job that runs the diagnostics at fixed intervals (like every chunk or every 5 chunks). It will be easy to create a job that creates a recipe for the available data and run ESMValTool

Reliably caching preprocessor results is not trivial, see also the discussion here https://github.com/ESMValGroup/ESMValTool/issues/348#issuecomment-389228165. I think one of the ways in which this could be done in a slightly reliable way is by carefully tracking provenance and only reusing a data file if it has the exact same provenance. That would not work for changing datasets though (your use case). The way it was done in version 1 was completely unreliable as far as I know. Here are some issues that need to be taken into consideration:
1) The source data may have changed
2) The recipe may have changed, because of multi model functions and the 'reference_dataset' option it requires numerous checks to see if results need to be recomputed
3) The esmvaltool software may have changed
4) Libraries used by esmvaltool may have changed

As I've said before, before considering implementing caching preprocessor data, I think the following must be the case:
1) esmvaltool is too slow
2) the majority of the runtime (i.e. 80 percent or so) is spent by the preprocessor for a representative number of recipes
3) It is not possible to make the preprocessor faster in some easier way by implementing slow preprocessor functions in a more efficient way.

so far I haven't seen proof that either of these three conditions is true. Has anyone take time to investigate this? Has a representative number of recipes been ported to version 2?

@bjoernbroetz and @LisaBock have been working a lot on this topic with our EMAC model, but with v1.
Still they might provide some hints.

So, for v1 we created a light weight wrapper around the tool to use it for monitoring of running simulations:

https://github.com/ESMValGroup/ESMValTool-private/tree/CMIP6DICAD_quicklooks_new/util/qlwrapper

It performs a "poor mans" parallelization (split a big old namelist to smaller ones and submit them). The diagnostics used produce netcdf files that are reused by themselves in the next cycle. So the mechanism here for "time series data " is not using the old climo files (the system of "caching" in v1) to update the plot but to let the diagnostic produce an nc-file with the data only needed for the plot.

This "other way of caching" doesn't overcome any of the points @bouweandela mentioned above. But we have been pragmatic here...

After a discussion with @ledm, @bouweandela, @jvegasbsc @zklaus and myself at the June 2019 workshop we came up with the following points that might need to be addressed to put this to action:

  1. ESMValTool acting on native model output (i.e. netcdf but not following cf-convention etc.) for each model that the monitoring shall be applied to
  2. persistent data files/folder that keep the information from previous cycles
  3. automatic report generation for the output

Point 1 could be addressed by adding fixes to the cmor module inside of the tool. Using external tools for this task was dismissed.
Point 2 could be addressed by passing a persistent path via the recipe to special monitoring diagnostics that will append to netcdf-files in the persistent path and create plots based on those files.
Point 3 is discussed at https://github.com/ESMValGroup/ESMValCore/issues/134

Was this page helpful?
0 / 5 - 0 ratings

Related issues

valeriupredoi picture valeriupredoi  路  3Comments

bjlittle picture bjlittle  路  5Comments

lukasbrunner picture lukasbrunner  路  4Comments

valeriupredoi picture valeriupredoi  路  4Comments

valeriupredoi picture valeriupredoi  路  5Comments