Cylc-flow: host selection command failure not handled properly

Created on 2 Jan 2020  路  18Comments  路  Source: cylc/cylc-flow

Tested with 7.8.4 using the following suite:

[scheduling]
    [[dependencies]]
        graph = root
[runtime]
    [[bad-host]]
        [[[remote]]]
            host = bad
#            host = $(rose host-select bad)

With host = bad the suite log contains

2020-01-02T10:00:43Z ERROR - [remote-init cmd] cat '<tempfile.SpooledTemporaryFile instance at 0x7f92c4dd1d40>' | cylc remote-init --host=bad 29cfe816-38b2-4fb3-923e-9fe6f682afe2 '$HOME/cylc-run/test.submision-failure'
    [remote-init ret_code] 1
    [remote-init err]
    ssh: Could not resolve hostname bad: Name or service not known

    ERROR: command returns 255: ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no bad env CYLC_VERSION=7.8.4 bash --login -c ''"'"'exec "$0" "$@"'"'"'' cylc remote-init 29cfe816-38b2-4fb3-923e-9fe6f682afe2 ''"'"'$HOME/cylc-run/test.submision-failure'"'"''
2020-01-02T10:00:43Z INFO - [bad-host.1] -submit-num=01, owner@host=bad
2020-01-02T10:00:43Z ERROR - [jobs-submit cmd] (init bad)
    [jobs-submit ret_code] 1
    [jobs-submit err] REMOTE INIT FAILED

and the job activity log contains

[jobs-submit cmd] (init bad)
[jobs-submit ret_code] 1
[jobs-submit err] REMOTE INIT FAILED

This looks reasonable although it would be nice if the job activity log contained the ssh output.

With host = $(rose host-select bad) the suite log contains

2020-01-02T11:07:04Z ERROR - [remote-host-select cmd] timeout 10 bash -c 'rose host-select bad'
    [remote-host-select ret_code] 1
    [remote-host-select err]
    [WARN] bad: (ssh failed)
    [FAIL] No hosts selected.
2020-01-02T11:07:04Z ERROR - rose host-select bad: host selection failed:
    COMMAND FAILED (1): rose host-select bad
    COMMAND STDERR: [WARN] bad: (ssh failed)
    COMMAND STDERR: [FAIL] No hosts selected.

2020-01-02T11:07:04Z ERROR - [Errno 2] No such file or directory: '/home/h02/frdm/cylc-run/test.submision-failure/log/job/1/bad-host/01/job-activity.log'
    Traceback (most recent call last):
      File "/net/home/h03/fcm/cylc-7.8.4/lib/cylc/task_events_mgr.py", line 84, in log_task_job_activity
        with open(job_activity_log, "ab") as handle:
    IOError: [Errno 2] No such file or directory: '/home/h02/frdm/cylc-run/test.submision-failure/log/job/1/bad-host/01/job-activity.log'
2020-01-02T11:07:04Z INFO - [jobs-submit cmd] (remote host select)
    [jobs-submit ret_code] 1
    [jobs-submit err]
    rose host-select bad: host selection failed:
    COMMAND FAILED (1): rose host-select bad
    COMMAND STDERR: [WARN] bad: (ssh failed)
    COMMAND STDERR: [FAIL] No hosts selected.

There is no job activity log file.
Expected behaviour is to have the job activity log file created (and no traceback in the suite log).

bug

Most helpful comment

context; it's obvious to you that $(rose host-select bad) will always fail, however, it's not possible for Cylc to know that. The result you get could (and is intended to be) different every time.

All 18 comments

I think "bug" is a little harsh, we just want to handle the error better :grin:

@dpmatthews does this also cover that issue we found where someone had used $(rose host-select $(rose host-select))?

Been having a look at this, and from what I can tell, the job activity log contents

[jobs-submit cmd] (init bad)
[jobs-submit ret_code] 1
[jobs-submit err] REMOTE INIT FAILED

come from log_task_job_activity() in this part of submit_task_jobs()

https://github.com/cylc/cylc-flow/blob/9a17d203803d3500a758d316cf10d0d1a6ea12d2/lib/cylc/task_job_mgr.py#L252-L263

The function log_task_job_activity():

https://github.com/cylc/cylc-flow/blob/9a17d203803d3500a758d316cf10d0d1a6ea12d2/lib/cylc/task_events_mgr.py#L74-L85

also writes to the suite log

https://github.com/cylc/cylc-flow/blob/9a17d203803d3500a758d316cf10d0d1a6ea12d2/lib/cylc/task_events_mgr.py#L92-L93

(this where the

2020-01-02T10:00:43Z ERROR - [jobs-submit cmd] (init bad)
    [jobs-submit ret_code] 1
    [jobs-submit err] REMOTE INIT FAILED

bit comes from I think)

However, I haven't been able to figure out which code is writing

2020-01-02T10:00:43Z ERROR - [remote-init cmd] cat '<tempfile.SpooledTemporaryFile instance at 0x7f92c4dd1d40>' | cylc remote-init --host=bad 29cfe816-38b2-4fb3-923e-9fe6f682afe2 '$HOME/cylc-run/test.submision-failure'
    [remote-init ret_code] 1
    [remote-init err]
    ssh: Could not resolve hostname bad: Name or service not known

to the suite log. Presumably some LOG.error() call that can possibly be replaced with an appropriate log_task_job_activity() call?

Hope that isn't all a load of nonsense!

TL;DR I have figured out which part of the code is logging the SSH error in the suite log, but I don't see a way to log it in the job activity log.


The part that is LOG.error()ing the SSH error:

https://github.com/cylc/cylc-flow/blob/cab3598a2dad80e31872938ae7c29d2669f384bd/lib/cylc/task_remote_mgr.py#L307-L327

(the 2 LOG.error()s near the bottom)

The value self.remote_init_map[(host, owner)] = REMOTE_INIT_FAILED in the last line is what is returned by the method TaskRemoteMgr.remote_init() in task_remote_mgr.py

This method is called by TaskJobManager.submit_task_jobs() in:

https://github.com/cylc/cylc-flow/blob/cab3598a2dad80e31872938ae7c29d2669f384bd/lib/cylc/task_job_mgr.py#L213-L214

leading to

https://github.com/cylc/cylc-flow/blob/cab3598a2dad80e31872938ae7c29d2669f384bd/lib/cylc/task_job_mgr.py#L252-L263

In order to print the SSH error in the job activity log, there would have to be some way to obtain the SubProcContext object (containing the command to run and its status) in TaskJobManager.submit_task_jobs() in task_job_mgr.py. I'm talking about this SubProcContext:

https://github.com/cylc/cylc-flow/blob/cab3598a2dad80e31872938ae7c29d2669f384bd/lib/cylc/task_remote_mgr.py#L225-L228

which is logged here

https://github.com/cylc/cylc-flow/blob/cab3598a2dad80e31872938ae7c29d2669f384bd/lib/cylc/task_remote_mgr.py#L326

and in string format is

[remote-init cmd] cat '<tempfile.SpooledTemporaryFile instance at 0x7fb6d38ed170>' | cylc remote-init --host=bad e2999112-98ed-4cee-b4c1-ffe534409a89 '$HOME/cylc-run/issue-3472'
[remote-init ret_code] 1
[remote-init err]
ssh: Could not resolve hostname bad: Name or service not known

ERROR: command returns 255: ssh -oBatchMode=yes -oConnectTimeout=10 bad env CYLC_VERSION=7.8.4-50-g9fc4d9-dirty bash --login -c ''"'"'exec "$0" "$@"'"'"'' cylc remote-init e2999112-98ed-4cee-b4c1-ffe534409a89 ''"'"'$HOME/cylc-run/issue-3472'"'"''

For host = $(rose host-select bad):

The bit responsible for putting the suite log error is (just above def _remote_init_callback() actually)
https://github.com/cylc/cylc-flow/blob/7b5389e1a3fe874b50fb8b18d5500e3a922ba09c/lib/cylc/task_remote_mgr.py#L293-L305

If the job activity log dir already exists, then the following gets written to it:

rose host-select bad: host selection failed:
COMMAND FAILED (1): rose host-select bad
COMMAND STDERR: [WARN] bad: (ssh failed)
COMMAND STDERR: [FAIL] No hosts selected.

log_task_job_activity() gets called here:
https://github.com/cylc/cylc-flow/blob/7b5389e1a3fe874b50fb8b18d5500e3a922ba09c/lib/cylc/task_job_mgr.py#L797-L804

which is called in
https://github.com/cylc/cylc-flow/blob/7b5389e1a3fe874b50fb8b18d5500e3a922ba09c/lib/cylc/task_job_mgr.py#L734-L737

https://github.com/cylc/cylc-flow/blob/7b5389e1a3fe874b50fb8b18d5500e3a922ba09c/lib/cylc/task_job_mgr.py#L754-L765

As for the traceback in the suite log, an explanation is here in def log_task_job_activity():
https://github.com/cylc/cylc-flow/blob/1eb3a62780cf6b04ea4ee610952e14bfb4ce9830/lib/cylc/task_events_mgr.py#L83-L93

The bad host error seems to occur before the job is actually submitted, so before creation of the job directory. I'm not sure how easy it would be to ensure the job dir is created in this case. Would it be acceptable to just do this change so as not to get traceback?

diff --git a/lib/cylc/task_events_mgr.py b/lib/cylc/task_events_mgr.py
index 2df9cdc6f..65d2a5693 100644
--- a/lib/cylc/task_events_mgr.py
+++ b/lib/cylc/task_events_mgr.py
@@ -87,12 +87,12 @@ def log_task_job_activity(ctx, suite, point, name, submit_num=None):
         # This happens when there is no job directory, e.g. if job host
         # selection command causes an submission failure, there will be no job
         # directory. In this case, just send the information to the suite log.
-        LOG.exception(exc)
-        LOG.info(ctx_str)
-    if ctx.cmd and ctx.ret_code:
-        LOG.error(ctx_str)
-    elif ctx.cmd:
-        LOG.debug(ctx_str)
+        LOG.info('%s\n(Job directory not yet created)', exc)
+    else:
+        if ctx.cmd and ctx.ret_code:
+            LOG.error(ctx_str)
+        elif ctx.cmd:
+            LOG.debug(ctx_str)

Is it ok to use LOG.error(message)聽in the suite log where host = badHost? Initially I thought that this would be bad since this will stop the suite (LOG.warning(message)聽would fix that), but actually I'm not sure:

  • If the host is聽a typo or plain wrong聽we probably want the suite to fail.
  • If the host is temporarily unavailable聽we don't.

Closed by #3569? Or do we need to increase the milestone for porting?

Waiting for platforms work to go in before we try to apply the fix to master.

Now that platforms has gone in,

[scheduling]
    [[dependencies]]
        graph = foo
[runtime]
    [[foo]]
        platform = bad

does not pass validation (PlatformLookupError: No matching platform "bad" found) so that problem is sorted. _Edit: ah wait, do I have to write a non-existent platform bad in global.cylc?_

I tried platform = $(rose host-select bad) but it seems to completely ignore it. I don't know how/if you should use rose host-select with platforms.

I tried platform = $(rose host-select bad) but it seems to completely ignore it. I don't know how/if you should use rose host-select with platforms.

This should warn only at validation, but fail job submit.

This should warn only at validation, but fail job submit.

It passes validation without warning and just runs locally without incident

On master?
Looking at this:

  • I can't see the warning which ought to occur - it looks like I never plumbed the warning in - #3850 should answer.
  • I have no idea how you've managed to avoid the job submission failure, unless it's because the suite you've posted tries to run naked dummy job foo on localhost, whilst it doesn't try and run bad-host at all!

Now that I fixed my example workflow

[scheduling]
    [[dependencies]]
        graph = foo
[runtime]
    [[foo]]
        # platform = bad
        platform = $(rose host-select bad)

In the suite log I get

ERROR - [remote-host-select cmd] bash -c 'rose host-select bad'
    [remote-host-select ret_code] 1
    [remote-host-select out]
    [remote-host-select err]
    [WARN] bad: (ssh failed)
    [FAIL] No hosts selected.
ERROR - [jobs-submit cmd] (remote host select)
    [jobs-submit ret_code] 1
    [jobs-submit err]
    ('rose host-select bad', None): host selection failed:
    COMMAND FAILED (1): rose host-select bad
    COMMAND STDERR: [WARN] bad: (ssh failed)
    COMMAND STDERR: [FAIL] No hosts selected.
INFO - [foo.1] status=queued: (internal)submission failed at 2020-10-02T14:33:26+01:00  for job(01) flow(E)
ERROR - [foo.1] -submission failed
WARNING - suite stalled

while in the job activity log I get

[jobs-submit cmd] (remote host select)
[jobs-submit ret_code] 1
[jobs-submit err]
('rose host-select bad', None): host selection failed:
COMMAND FAILED (1): rose host-select bad
COMMAND STDERR: [WARN] bad: (ssh failed)
COMMAND STDERR: [FAIL] No hosts selected.

And the job activity log is created even if I delete everything apart from flow.cylc and run it from scratch.

So, closing as Tim has solved the problem on master

Oh, but there is no validation warning for platform = $(rose host-select bad) by the way @wxtim

Oh, but there is no validation warning for platform = $(rose host-select bad) by the way @wxtim

Yes, apparently it's likely to be ultra common, and either unnecessary or superfluous.

context; it's obvious to you that $(rose host-select bad) will always fail, however, it's not possible for Cylc to know that. The result you get could (and is intended to be) different every time.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sadielbartholomew picture sadielbartholomew  路  4Comments

oliver-sanders picture oliver-sanders  路  3Comments

oliver-sanders picture oliver-sanders  路  4Comments

dpmatthews picture dpmatthews  路  3Comments

hjoliver picture hjoliver  路  5Comments