Esmvaltool: preproc files highly demanding on disk space

Created on 15 May 2018  路  6Comments  路  Source: ESMValGroup/ESMValTool

Thanks @mattiarighi for giving a great, detailed overview of ESMValTool v2.0.0 to the group at DLR earlier today. What you guys are developing looks great and we are looking forward to getting our hands on things.

One issue came up during the meeting about the disk space usage of the preproc (former: climo) files. The way it is now implemented, the backend creates one new file for each model entry, each variable, for every preprocessing-settings combination, and for each time the namelist is being run (due to the timestemp directory structure). This will easily create hundreds of GB worth of preproc files. For example, just 4-5 run-throughs with a small namelist like namelist_SeaIce, will add up to half a TB of preproc files. Sure, the preproc files of older runs can be deleted manually, but an option to recycle files that have been created with the same settings would be much more desirable.

All 6 comments

  • Adding an option to config-user.yml to automatically delete all preprocessed files after a successful run should be fairly easy and will solve the disk space issue.
  • Adding reliable caching of preprocessor results, as suggested at the end of the text, is not at all trivial and will require more work. It only makes sense to implement such a feature if it turns out that esmvaltool is slower than required and preprocessing takes a considerable amount of time compared to the amount of time spent in diagnostic scripts. At the moment we have very few diagnostics implemented, so it is impossible to judge if such a feature would be useful. I recommend waiting until we have at least a representative number of diagnostic scripts ported to version 2 before considering working on caching preprocessor results.

Alright, fine by me. Option 1 sounds good for now.
After the workshop we will probably have enough diagnostics ported to think about caching the preprocessed files. So this could be on the list for beta then, right?

Added to the to-do list in the beta project. :+1:

as @bouweandela says, option no 1 should be fairly straightforward -- one comment is this should be activated only if save_intermediary_cubes is set to False and debug level is in info mode, since otherwise it's obvious that the user wants to do some sorts of debugging and would want to keep his files around. Caching can be done as well, bit more involving than option 1 in terms of implementation but this could be very useful in the case of running ESMValTool in a centralized mode (eg on a dedicated partition on a big cluster)

PR #410 merged, branch deleted, issue closed, dogs walked

Was this page helpful?
0 / 5 - 0 ratings