Cylc-flow: Forwards compatibility for job shell library extraction

Created on 26 Jul 2019  路  14Comments  路  Source: cylc/cylc-flow

Note the below is the original problem that I observed & reported. The underlying issue is broader, so first read https://github.com/cylc/cylc-flow/issues/3250#issuecomment-517249767. Thanks.


Suites run using 8.0a0 can appear in the Cylc Review 'cycles list' (/cycles/) page with None listed under the '# jobs' heading for succeeded &/or failed jobs, whereas for suites run under Cylc 7, these columns never contain None, always an integer & notably 0 to indicate no jobs.

Note this is not consistently the case, as sometimes affected columns also display integers (see screenshots below).

I'm not sure how much of a problem this is (yet, I'll investigate soon) & it could simply be an issue with display (i.e. Cylc Review), but it could indicate a deeper problem in job data or processing for the current Cylc 8.

Release version(s) and/or repository branch(es) affected?

Having inspected many different suites for a sample of different users across the Met Office, using our internal Cylc Review service, I have determined that this issue only arises for suites that have (last) run using our development Cylc 8 version, i.e. have a timestamp & then INFO - Cylc version: 8.0a0 towards the start of the log/suite/log (as I suspected after seeing this only for members of the MO team, & nobody else).

Note that I was using Cylc Review on 7.8.3, as there is no such service for 8.0a0, with it being removed & not re-added yet.

Steps to reproduce the bug
Run a suite using both Cylc 8.0a0 & some Cylc 7.8.x, in turn. In each case, after running, navigate to the 'cycles list' page for that suite & examine the '# jobs' column.

Expected behavior
An integer instead of None for the jobs columns & (following from this, but as a separate requirement) a consistent report format for suites run with different versions of Cylc.

Screenshots

For two separate users who ran suites under 8.0a0 (Matt & I):

NONE_scr_01a
NONE_scr_02a

Whereas for all suites ran/running with Cylc 7, the jobs columns consistently show integers, e.g:

NONE_fine_scr

bug

Most helpful comment

Sounds OK. Should also add to list of items to sync to remote job hosts:

That shouldn't be necessary as I'm already extracting job.sh from package resources on remote job hosts, at remote init time (i.e. when the first job is submitted to the remote host).

https://github.com/cylc/cylc-flow/blob/master/cylc/flow/task_remote_cmd.py#L55

All 14 comments

Okay, I'm doing a little investigation into this now. In the majority of (but not all) cases of suites showing this Issue in CR, I am seeing the following error in the job.err of tasks in the cycle points where some or all job status counts come up as 'None', & not for tasks in unaffected cycle points:

<...>/cylc-run/<suite name>/log/job/1/<task name>/<cycle point>/job: line 32: <...>/cylc-run/<suite name>/.service/etc/job.sh: No such file or directory
bash: cylc__job__main: command not found

It may or may not be related.

This is the same error reported in #3133.

Seems to relate to the installation of the job.sh into the suite run directory, which was discussed here & added in the commit referenced. For suites with some 'None' job count reports, there seems to be no etc under their .service, so there was perhaps some installation issue.

@hjoliver, does the following seem like it could be correct to you (this comment should summarise this Issue, so that you won't need to read the above!)?

I've come back to this & I think I have worked out what is happening here, from running a sample of suites in various contexts & inspecting the .service & Cylc Review (I've amended the title of the Issue to reflect this new realisation):

  • For newly-created suites which are run first under Cylc 8, the job.sh shell function library gets installed to the <run-dir>/.service/etc directory on start-up, as set up in https://github.com/cylc/cylc-flow/pull/3162/commits/453d90986e22c836d00232a215eb34d941f72f48 (see this review discussion);
  • But for "old" suites that have previously run under Cylc 7, when running with Cylc 8 the job.sh does not get installed under the service directory at any point (see some further evidence below), so that all jobs for those suites run into the error of not locating it, hence the job.err message appearing for all jobs in those suites as described in https://github.com/cylc/cylc-flow/issues/3250#issuecomment-515762458, notably <...>/cylc-run/<suite name>/.service/etc/job.sh: No such file or directory;
  • This means that when running suites in 8 which ran previously under 7, the job.sh is not found & does not run, meaning messages about job finish status do not get sent from within that script as required;
  • That leads to the None report for job counts in Cylc Review, as originally wrote up for the Issue here, & possibly other upstream issues. (It should also explain much weirdness I have seen in task/job state behaviour of late in development, & not got round to looking into, as most suites I run are old ones).

Therefore, if my logic above is correct (can someone else look to see if they see the same symptoms & agree on the cause?), overall, we have a forward compatibility bug relating to the install of the job.sh. Since it is important that suites can be seamlessly ported from running under 7 to 8, it seems we need to either:

  • fix the logic so that the job.sh library is installed for suite that ran under Cylc 7 (or earlier) first, not just for suites run initially under 8;
  • add logic to check the location of the job.sh, & use the appropriate location.
Some evidence if helpful
  • new-cylc8 is a suite I have created today & run on Cylc 8 only;
  • subsuite-app-domingo I created last year & therefore ran in Cylc 7 mostly, but just ran in Cylc 8 (with the required environments used in each case, of course).
$ ls new-cylc8/.service/etc       
job.sh
$ ls subsuite-app-domingo/.service/etc
ls: cannot access subsuite-app-domingo/.service/etc: No such file or directory
$ ls subsuite-app-domingo/.service/   
db  passphrase  source  ssl.cert  ssl.pem

@sadielbartholomew - yes you're right. In fact, simply registering a suite under cylc-7 is enough to cause the problem - unless you re-register the suite with cylc-8 - because job.sh is installed by cylc-8 at registration time.

(This also means accidental deletion of .service/etc/job.sh will prevent cylc-8 jobs from running.)

(Mind you, deletion of anything in the suite service directory is bound to be dangerous, so that's fine :grimacing:).

Workaround:

$ cylc extract-resources ~/cylc-run/SUITE/.service etc/job.sh  

(Cylc 7 didn't have this problem because we stored job.sh in the Python library at the then-known-but-now-obsolete CYLC_DIR location)

I suppose the right thing to do is extract job.sh at suite start-up rather than registration time.

Any objections?

I suppose the right thing to do is extract job.sh at suite start-up rather than registration time.

Sounds like a good solution to me, so no objection here! Though for a second opinion, I will bump @matthewrmshin (in person later at work, but tagging to help).

Sounds OK. Should also add to list of items to sync to remote job hosts:

That shouldn't be necessary as I'm already extracting job.sh from package resources on remote job hosts, at remote init time (i.e. when the first job is submitted to the remote host).

https://github.com/cylc/cylc-flow/blob/master/cylc/flow/task_remote_cmd.py#L55

Great!

I suppose the right thing to do is extract job.sh at suite start-up rather than registration time.

Excellent, sounds like we're good to go with this solution. Were you planning to implement this @hjoliver? If so, we should get you assigned on the Issue, but if not, I am happy to have a go?

I'll do it (I just had a look at the code, and it's a quick job).

Was this page helpful?
0 / 5 - 0 ratings