Esmvaltool: Why do we want to CMORize all observational datasets?

Created on 4 Jun 2019 · 5Comments · Source: ESMValGroup/ESMValTool

Maybe we should rather define certain interfaces to existing packages that take care of reading datasets into common Python data structures. E.g. particularly suitable for reading a very diverse set of data in different formats is intake. Another interesting project focusing on satellite datasets is open data cube.

observations question

Source

bascrezee

All 5 comments

@hb326 thoughts? Comments?

zklaus on 4 Jun 2019

@mattiarighi just answered this question offline to me, his answer is :

because we want to have a pool of observational data

bascrezee on 4 Jun 2019

That does not really answer the question, because you can also have a pool of observational data without reformatting it.

I think the real answer is probably more a perceived run-time advantage, reformatting takes some time, so if you have to do it every time you run a recipe, it could potentially be slower.

bouweandela on 6 Jun 2019

I'll take a step back and point to a few things:

reformatting needs to be done according to a number of standards: CF and CMOR conventions most importantly but also ESMValTool-specific conventions that are not forcefully imposed but it makes life easier (ie preferred time units, preferred metadata items etc) so it's much better if reformatting is done once and reformatted data is shoved into a box from where it can be used right out the box;
as @bouweandela points out reformatting can be done on the fly but can potentially be time and CPU consuming depending on how much data needs to be converted;
one social aspect to this matter is the user's comfort knowing that there is a database with nicely formatted data where they can just point the tool to and all is done smoothly (ie risk of tool's failure is smaller since it needs to perform less actions) - same aspects the ESGF nodes provide the user - a nice place where nice data lives (nice my arse, given how many problem ESGF data has but that's a different fish altogether)
lastly, the question of LARGE datasets comes in my mind, the ones that are so large that they can't be stored in one place but need to be reformatted on the fly -> that is something that should probably be done on the fly, but apart from that I reckon if we can store data then we should run the reformatting as few times as possible
:beer: