Esmvaltool: Delay concatenating after fixing metadata

Created on 12 Apr 2018  Â·  11Comments  Â·  Source: ESMValGroup/ESMValTool

I have some trouble with a dataset that has some random files with incorrect metadata. With the current backend workflow, I have to fix this at file level to be able to concatenate them into a single cube but this implies writing all the data to disk.

My idea is to add an extra merge/concatenate step after fixing metadata to allow us to fix these errors in the cube and avoid the extra writing.

What do you think, @mattiarighi @valeriupredoi ?

All 11 comments

Sounds good!

sounds like a good plan, man! Do we not have the merge/concatenate call there anyway for the cases when we have more than one file to cover the needed timespan or is that before fixing metadata? In any case, I am wondering what would be a good approach to push these fixed files somwhere where users can use the fixed ones and not fix them everytime they run ESMValTool. Like, on BADC for instance, we could ask the data people 'hey, these are better files, replace the bad ones you are currently storing'

The concatenate was done at the loading step, meaning that if there was any incompatibility in metadata we ended with a bunch of cubes that will make the backend crash in an unexpected way

probably legacy from my backend version from last year when I was young and naive :))

I would avoid storing and reusing fixed data: this increase the storage load for the user (which is already high from the output itself) and could be error prone. Also, as far as I understand, the fixes do not add much to cpu time, right?

sure yes, agreed! but what I was thinking of was pushing these fixed files
to databases like BADC to replace defective ones there

On Thu, Apr 12, 2018 at 11:52 AM, Mattia Righi notifications@github.com
wrote:

I would avoid storing and reusing fixed data: this increase the storage
load for the user (which is already high from the output itself) and could
be error prone. Also, as far as I understand, the fixes do not add much to
cpu time, right?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ESMValGroup/ESMValTool/issues/304#issuecomment-380763091,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AbpCozrlcL0Q80E0xNmrUG--55ZHQGbsks5tnzIFgaJpZM4TRfp5
.

--
Dr. Valeriu Predoi
Computational Scientist for UKESM Core Team
Department of Meteorology, University of Reading
Earley Gate, Office 1U08
READING, RG6 6BB
United Kingdom
Mobile number: 07847416092

"If one day you be questioning your ability to come up with professional
results, think of this: Noah's ark was built by farmers whereas the Titanic
was crafted by skilled engineers"

The problem is that all these files are tracked and replicated on several ESGF nodes (not only BADC, but also DKRZ, IPSL, etc.). Pushing fixed files to the database would require generating a new tracking-ID and make sure they are synchronized across the nodes.

Also, I'm not sure this is allowed without contacting the respective model groups who generated the original data.

It's probably not worth the effort.

Ok, that sounds like a big headache, forget about it :)

On Thu, 12 Apr 2018, 11:59 Mattia Righi, notifications@github.com wrote:

The problem is that all these files are tracked and replicated on several
ESGF nodes (not only BADC, but also DKRZ, IPSL, etc.). Pushing fixed files
to the database would require generating a new tracking-ID and make sure
they are synchronized across the nodes.

Also, I'm not sure this is allowed without contacting the respective model
groups who generated the original data.

It's probably not worth the effort.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ESMValGroup/ESMValTool/issues/304#issuecomment-380764928,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AbpCo5S0wwdkj_3cG2qCczv2fKtjETdUks5tnzOVgaJpZM4TRfp5
.

Metadata fixes impact should be negligible, data fixes can be worse but I don't think we will never get to the point that they are really worth the extra space.

Anyway, at the moment it will be more important to be enable reusing backend outputs from other runs than any other thing

I think to some extent this is already allowed (the log shows some info about how to rerun diags)?

Yes it do :)

On Thu, 12 Apr 2018, 12:16 Mattia Righi, notifications@github.com wrote:

I think to some extent this is already allowed (the log shows some info
about how to rerun diags)?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ESMValGroup/ESMValTool/issues/304#issuecomment-380769672,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AbpCozyMv95HUWc5uIh1Due_Goc7Pi4Tks5tnzeDgaJpZM4TRfp5
.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

valeriupredoi picture valeriupredoi  Â·  3Comments

chris-to-pher picture chris-to-pher  Â·  3Comments

jhardenberg picture jhardenberg  Â·  5Comments

bouweandela picture bouweandela  Â·  4Comments

valeriupredoi picture valeriupredoi  Â·  4Comments