Esmvaltool: File version vulnerability

Created on 31 Jul 2018  路  19Comments  路  Source: ESMValGroup/ESMValTool

On BADC (and I assume could be the case of DKRZ as well): in _data_finder.py the two file finding funcs get_input_filelist() and get_input_fx_filelist() rely on identifying the latest version of a file on reverse sorting the list of dirs in part1/ - this is correct 99% of the times, but see case below:

(esmvaltool_v2_dev) [valeriu@jasmin-sci3 valeriu]$ ls -la /badc/cmip5/data/cmip5/output1/BCC/bcc-csm1-1/historical/fx/atmos/fx/r0i0p0/
total 480
drwxr-x--- 6 badc open 4096 Jun 16  2017 .
drwxr-x--- 3 badc open 4096 Jan 13  2012 ..
drwxr-x--- 5 badc open 4096 Feb 28  2012 .v1_01
-rw-r--r-- 1 badc open    0 Mar 15  2012 COPY_CURRENT_20150326.txt
drwxr-x--- 8 badc open 4096 Feb 28  2012 files
lrwxrwxrwx 1 badc open    2 Oct 11  2013 latest -> v1
drwxr-x--- 5 badc open 4096 Oct 11  2013 v1
drwxr-x--- 5 badc open 4096 Jan 13  2012 v20110101

here the latest latest points to a very cryptic v1 whereas the code picks up v20110101 as the latest, dir that contains junk files.

The fix for this would be a preferential first try to find latest then if not found go do the usual trick - ? @mattiarighi what is the latest dir called on DKRZ, is it latestversion?

All 19 comments

At DKRZ this is always vYYYYMMDD and the code should find the most recent date here.

For BADC, it should pickup latest, according to the config-developer.yml file for this drs.

cheers @mattiarighi - and nope, the code assembles a reversed order list and goes through it looking for valid subdirs and stops once it finds the first existing subdir. According to this logic, it will always pick up the latest symlinked version vYYYYMMDD but this will render itself useless if latest is symlinked to another directory (as the case above)

On DKRZ this problem used to be present, too. But after some "restructuring" over there they only left the most recent version. There used to be a "heuristic" implemented by Simon in interface_scripts/projects.py that tried to deal with this problem.

I am sure at BADC it is the same, but hoping for the best is a bit of a trap, especially when it comes to CEDA/Jasmin/BADC :grinning: - the easiest way to correct this potential issue is to first check if there is a latest and pick up its contents, shit hits the fan if latest is symlinked against something older but I think BADC has a script that checks for that not to happen everytime a release is added

There is another thing. Usually in the newer version directory you only have the updated datasets. The unchanged datasets (in terms of newer version) are only available from the older version directories. So you can have:

|_tas
|_pr
v20180101
|_tas
|_pr
|_sic

In which case you will miss the sic dataset if you only look at the highest version number, as it is done in get_input_filelist() if I see it right.

crap, @bjoernbroetz this issue is getting hairy, just gonna go for a smoke and forget about it then :grinning: Seriously now, I didn't know that - but are not the cluster guys supposed to symlink everything once a new release (version) is added to the database? If this is the case, then we can potentially miss a whole bunch of (older but still good) data

I rather think the cluster guys leave this as an exercise for the user ...
Synda is doing no such symlinks, if I see it right. And they use synda to create their folder trees.

this fixes the problem for fx files for BADC, should we do the same for regular data files as well? https://github.com/ESMValGroup/ESMValTool/pull/488/commits/246e10423cccb3cbdba10b7cf2e0cca8997b3abf

Wait, the case raised above by @bjoernbroetz has been solved some time ago.

See #319

look at that, I even said OK to that PR :grin: yes -- this is what the code does now and it's not picking up latest with a preference (well, not for the data files) but the latest version vYYYYMMDD, the problem is what if the really latest version is called v1 or CAT_ON_THE_ROOF and correctly symlinked from latest

If correctly simlinked then fine, because for BADC it will pick-up latest.
If not correctly simlinked, then it's a problem of the data manager.

After a short conversation with @mattiarighi we concluded that we need your patch 246e104 to prioritize the "latest" folder if present (for fx and regular data).

Just for the record:
The situation addressed here is that a bad version of data entered the system and is now there with a higher version number. The data center cannot remove this data (policy) but the corrected version is still not there. So they point their symlink "latest" to the version before. This might affect only a single variable.

Is this solved? @valeriupredoi @bjoernbroetz

I think it is solved for get_input_fx_filelist() and merged with PR #488
but as @valeriupredoi wrote earlier it still needs to be fixed in get_input_filelist(). I will do it fast now.

A small patch is now here: 9981bc8b75e8c8ca00f907f46343c481e9181d02

Can you open a PR? I will merge it.

cool guys @mattiarighi @bjoernbroetz :beer:

Was this page helpful?
0 / 5 - 0 ratings