Tested with 7.8.4 using the following suite:
[scheduling]
[[dependencies]]
graph = root
[runtime]
[[bad-host]]
[[[remote]]]
host = bad
# host = $(rose host-select bad)
With host = bad the suite log contains
2020-01-02T10:00:43Z ERROR - [remote-init cmd] cat '<tempfile.SpooledTemporaryFile instance at 0x7f92c4dd1d40>' | cylc remote-init --host=bad 29cfe816-38b2-4fb3-923e-9fe6f682afe2 '$HOME/cylc-run/test.submision-failure'
[remote-init ret_code] 1
[remote-init err]
ssh: Could not resolve hostname bad: Name or service not known
ERROR: command returns 255: ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no bad env CYLC_VERSION=7.8.4 bash --login -c ''"'"'exec "$0" "$@"'"'"'' cylc remote-init 29cfe816-38b2-4fb3-923e-9fe6f682afe2 ''"'"'$HOME/cylc-run/test.submision-failure'"'"''
2020-01-02T10:00:43Z INFO - [bad-host.1] -submit-num=01, owner@host=bad
2020-01-02T10:00:43Z ERROR - [jobs-submit cmd] (init bad)
[jobs-submit ret_code] 1
[jobs-submit err] REMOTE INIT FAILED
and the job activity log contains
[jobs-submit cmd] (init bad)
[jobs-submit ret_code] 1
[jobs-submit err] REMOTE INIT FAILED
This looks reasonable although it would be nice if the job activity log contained the ssh output.
With host = $(rose host-select bad) the suite log contains
2020-01-02T11:07:04Z ERROR - [remote-host-select cmd] timeout 10 bash -c 'rose host-select bad'
[remote-host-select ret_code] 1
[remote-host-select err]
[WARN] bad: (ssh failed)
[FAIL] No hosts selected.
2020-01-02T11:07:04Z ERROR - rose host-select bad: host selection failed:
COMMAND FAILED (1): rose host-select bad
COMMAND STDERR: [WARN] bad: (ssh failed)
COMMAND STDERR: [FAIL] No hosts selected.
2020-01-02T11:07:04Z ERROR - [Errno 2] No such file or directory: '/home/h02/frdm/cylc-run/test.submision-failure/log/job/1/bad-host/01/job-activity.log'
Traceback (most recent call last):
File "/net/home/h03/fcm/cylc-7.8.4/lib/cylc/task_events_mgr.py", line 84, in log_task_job_activity
with open(job_activity_log, "ab") as handle:
IOError: [Errno 2] No such file or directory: '/home/h02/frdm/cylc-run/test.submision-failure/log/job/1/bad-host/01/job-activity.log'
2020-01-02T11:07:04Z INFO - [jobs-submit cmd] (remote host select)
[jobs-submit ret_code] 1
[jobs-submit err]
rose host-select bad: host selection failed:
COMMAND FAILED (1): rose host-select bad
COMMAND STDERR: [WARN] bad: (ssh failed)
COMMAND STDERR: [FAIL] No hosts selected.
There is no job activity log file.
Expected behaviour is to have the job activity log file created (and no traceback in the suite log).
I think "bug" is a little harsh, we just want to handle the error better :grin:
@dpmatthews does this also cover that issue we found where someone had used $(rose host-select $(rose host-select))?
Been having a look at this, and from what I can tell, the job activity log contents
[jobs-submit cmd] (init bad)
[jobs-submit ret_code] 1
[jobs-submit err] REMOTE INIT FAILED
come from log_task_job_activity() in this part of submit_task_jobs()
The function log_task_job_activity():
also writes to the suite log
(this where the
2020-01-02T10:00:43Z ERROR - [jobs-submit cmd] (init bad)
[jobs-submit ret_code] 1
[jobs-submit err] REMOTE INIT FAILED
bit comes from I think)
However, I haven't been able to figure out which code is writing
2020-01-02T10:00:43Z ERROR - [remote-init cmd] cat '<tempfile.SpooledTemporaryFile instance at 0x7f92c4dd1d40>' | cylc remote-init --host=bad 29cfe816-38b2-4fb3-923e-9fe6f682afe2 '$HOME/cylc-run/test.submision-failure'
[remote-init ret_code] 1
[remote-init err]
ssh: Could not resolve hostname bad: Name or service not known
to the suite log. Presumably some LOG.error() call that can possibly be replaced with an appropriate log_task_job_activity() call?
Hope that isn't all a load of nonsense!
TL;DR I have figured out which part of the code is logging the SSH error in the suite log, but I don't see a way to log it in the job activity log.
The part that is LOG.error()ing the SSH error:
(the 2 LOG.error()s near the bottom)
The value self.remote_init_map[(host, owner)] = REMOTE_INIT_FAILED in the last line is what is returned by the method TaskRemoteMgr.remote_init() in task_remote_mgr.py
This method is called by TaskJobManager.submit_task_jobs() in:
leading to
In order to print the SSH error in the job activity log, there would have to be some way to obtain the SubProcContext object (containing the command to run and its status) in TaskJobManager.submit_task_jobs() in task_job_mgr.py. I'm talking about this SubProcContext:
which is logged here
and in string format is
[remote-init cmd] cat '<tempfile.SpooledTemporaryFile instance at 0x7fb6d38ed170>' | cylc remote-init --host=bad e2999112-98ed-4cee-b4c1-ffe534409a89 '$HOME/cylc-run/issue-3472'
[remote-init ret_code] 1
[remote-init err]
ssh: Could not resolve hostname bad: Name or service not known
ERROR: command returns 255: ssh -oBatchMode=yes -oConnectTimeout=10 bad env CYLC_VERSION=7.8.4-50-g9fc4d9-dirty bash --login -c ''"'"'exec "$0" "$@"'"'"'' cylc remote-init e2999112-98ed-4cee-b4c1-ffe534409a89 ''"'"'$HOME/cylc-run/issue-3472'"'"''
For host = $(rose host-select bad):
The bit responsible for putting the suite log error is (just above def _remote_init_callback() actually)
https://github.com/cylc/cylc-flow/blob/7b5389e1a3fe874b50fb8b18d5500e3a922ba09c/lib/cylc/task_remote_mgr.py#L293-L305
If the job activity log dir already exists, then the following gets written to it:
rose host-select bad: host selection failed:
COMMAND FAILED (1): rose host-select bad
COMMAND STDERR: [WARN] bad: (ssh failed)
COMMAND STDERR: [FAIL] No hosts selected.
log_task_job_activity() gets called here:
https://github.com/cylc/cylc-flow/blob/7b5389e1a3fe874b50fb8b18d5500e3a922ba09c/lib/cylc/task_job_mgr.py#L797-L804
which is called in
https://github.com/cylc/cylc-flow/blob/7b5389e1a3fe874b50fb8b18d5500e3a922ba09c/lib/cylc/task_job_mgr.py#L734-L737
As for the traceback in the suite log, an explanation is here in def log_task_job_activity():
https://github.com/cylc/cylc-flow/blob/1eb3a62780cf6b04ea4ee610952e14bfb4ce9830/lib/cylc/task_events_mgr.py#L83-L93
The bad host error seems to occur before the job is actually submitted, so before creation of the job directory. I'm not sure how easy it would be to ensure the job dir is created in this case. Would it be acceptable to just do this change so as not to get traceback?
diff --git a/lib/cylc/task_events_mgr.py b/lib/cylc/task_events_mgr.py
index 2df9cdc6f..65d2a5693 100644
--- a/lib/cylc/task_events_mgr.py
+++ b/lib/cylc/task_events_mgr.py
@@ -87,12 +87,12 @@ def log_task_job_activity(ctx, suite, point, name, submit_num=None):
# This happens when there is no job directory, e.g. if job host
# selection command causes an submission failure, there will be no job
# directory. In this case, just send the information to the suite log.
- LOG.exception(exc)
- LOG.info(ctx_str)
- if ctx.cmd and ctx.ret_code:
- LOG.error(ctx_str)
- elif ctx.cmd:
- LOG.debug(ctx_str)
+ LOG.info('%s\n(Job directory not yet created)', exc)
+ else:
+ if ctx.cmd and ctx.ret_code:
+ LOG.error(ctx_str)
+ elif ctx.cmd:
+ LOG.debug(ctx_str)
Is it ok to use LOG.error(message)聽in the suite log where host = badHost? Initially I thought that this would be bad since this will stop the suite (LOG.warning(message)聽would fix that), but actually I'm not sure:
It seems the reason the job dir is not being created is in this method
https://github.com/cylc/cylc-flow/blob/179e8aa426d611e2b21bb282f01b711df23cde89/lib/cylc/task_job_mgr.py#L733-L738
which returns False
https://github.com/cylc/cylc-flow/blob/179e8aa426d611e2b21bb282f01b711df23cde89/lib/cylc/task_job_mgr.py#L751-L764
before it has a chance to get down to this line
https://github.com/cylc/cylc-flow/blob/179e8aa426d611e2b21bb282f01b711df23cde89/lib/cylc/task_job_mgr.py#L775-L776
The method _prep_submit_task_job_impl() is where the job dir gets created,
https://github.com/cylc/cylc-flow/blob/179e8aa426d611e2b21bb282f01b711df23cde89/lib/cylc/task_job_mgr.py#L842-L843
but it never runs
Closed by #3569? Or do we need to increase the milestone for porting?
Waiting for platforms work to go in before we try to apply the fix to master.
Now that platforms has gone in,
[scheduling]
[[dependencies]]
graph = foo
[runtime]
[[foo]]
platform = bad
does not pass validation (PlatformLookupError: No matching platform "bad" found) so that problem is sorted. _Edit: ah wait, do I have to write a non-existent platform bad in global.cylc?_
I tried platform = $(rose host-select bad) but it seems to completely ignore it. I don't know how/if you should use rose host-select with platforms.
I tried platform = $(rose host-select bad) but it seems to completely ignore it. I don't know how/if you should use rose host-select with platforms.
This should warn only at validation, but fail job submit.
This should warn only at validation, but fail job submit.
It passes validation without warning and just runs locally without incident
On master?
Looking at this:
foo on localhost, whilst it doesn't try and run bad-host at all!Now that I fixed my example workflow
[scheduling]
[[dependencies]]
graph = foo
[runtime]
[[foo]]
# platform = bad
platform = $(rose host-select bad)
In the suite log I get
ERROR - [remote-host-select cmd] bash -c 'rose host-select bad'
[remote-host-select ret_code] 1
[remote-host-select out]
[remote-host-select err]
[WARN] bad: (ssh failed)
[FAIL] No hosts selected.
ERROR - [jobs-submit cmd] (remote host select)
[jobs-submit ret_code] 1
[jobs-submit err]
('rose host-select bad', None): host selection failed:
COMMAND FAILED (1): rose host-select bad
COMMAND STDERR: [WARN] bad: (ssh failed)
COMMAND STDERR: [FAIL] No hosts selected.
INFO - [foo.1] status=queued: (internal)submission failed at 2020-10-02T14:33:26+01:00 for job(01) flow(E)
ERROR - [foo.1] -submission failed
WARNING - suite stalled
while in the job activity log I get
[jobs-submit cmd] (remote host select)
[jobs-submit ret_code] 1
[jobs-submit err]
('rose host-select bad', None): host selection failed:
COMMAND FAILED (1): rose host-select bad
COMMAND STDERR: [WARN] bad: (ssh failed)
COMMAND STDERR: [FAIL] No hosts selected.
And the job activity log is created even if I delete everything apart from flow.cylc and run it from scratch.
So, closing as Tim has solved the problem on master
Oh, but there is no validation warning for platform = $(rose host-select bad) by the way @wxtim
Oh, but there is no validation warning for platform = $(rose host-select bad) by the way @wxtim
Yes, apparently it's likely to be ultra common, and either unnecessary or superfluous.
context; it's obvious to you that $(rose host-select bad) will always fail, however, it's not possible for Cylc to know that. The result you get could (and is intended to be) different every time.
Most helpful comment
context; it's obvious to you that
$(rose host-select bad)will always fail, however, it's not possible for Cylc to know that. The result you get could (and is intended to be) different every time.