Cylc-flow: LOG.debug causes a failure of 16-timeout test in test-battery

Created on 17 Jan 2019 · 33Comments · Source: cylc/cylc-flow

I would like to add debugging information as to the command and arguments as well as the caller information to the subprocess popen wrapper.

If I add LOG.debug("test debug message") to the sprocess.py pcylc function the tests/job-submission/16-timeout.t test fails.

Without the debug log the egrep comparison in 16-timeout check correctly finds:-

ERROR - [jobs-submit cmd] cylc jobs-submit --debug -- /home/mryan/cylc-run/cylctb-20190117T010015Z/job-submission/16-timeout/log/job 1/foo/01
    [jobs-submit ret_code] -9
    [jobs-submit err] killed on timeout (PT10S)

When I add a debug statement the equivalent lines are:-

DEBUG - [jobs-submit cmd] cylc jobs-submit --debug -- /home/mryan/cylc-run/cylctb-20190117T011404Z/job-submission/16-timeout/log/job 1/foo/01
    [jobs-submit ret_code] 0
    [jobs-submit out] [TASK JOB SUMMARY]2019-01-17T12:14:08+11:00|1/foo/01|0|None

The return code is 0 and not -9

Test output with no debug statement in the wrapper look like this.

cylc test-battery -v tests/job-submission/16-timeout.t

ok 1 - 16-timeout-validate
ok 2 - 16-timeout-suite-run
ok 3 - log-cmp-ok
ok 4 - suite-state.log-contains-ok
ok
All tests successful.
Files=1, Tests=4, 21 wallclock secs ( 0.02 usr  0.00 sys +  2.05 cusr  0.77 csys =  2.84 CPU)
Result: PASS

Test output with one or more LOG.debug statements in the wrapper look like this.

cylc test-battery -v tests/job-submission/16-timeout.t

ok 1 - 16-timeout-validate
ok 2 - 16-timeout-suite-run
--- -   2019-01-17 15:41:51.354548153 +1100
+++ log 2019-01-17 15:41:51.063998311 +1100
@@ -1,3 +0,0 @@
-ERROR - [jobs-submit cmd] cylc jobs-submit --debug -- /home/mryan/cylc-run/cylctb-20190117T044115Z/job-submission/16-timeout/log/job 1/foo/01
-   [jobs-submit ret_code] -9
-   [jobs-submit err] killed on timeout (PT10S)
not ok 3 - log-cmp-ok
ok 4 - suite-state.log-contains-ok

    stdout and stderr stored in: /tmp/mryan/cylctb-20190117T044115Z/job-submission/16-timeout
Failed 1/4 subtests 

Test Summary Report
-------------------
tests/job-submission/16-timeout.t (Wstat: 0 Tests: 4 Failed: 1)
  Failed test:  3
Files=1, Tests=4, 35 wallclock secs ( 0.03 usr  0.00 sys +  1.77 cusr  0.69 csys =  2.49 CPU)
Result: FAIL

bug

Source

MartinRyan

All 33 comments

Funny. No solution/explanation, but can confirm it happens to me too (using your branch with bandit changes). Tried replacing the cmp_ok by two contains_ok and increasing the number of tests by one, but it would still miss the output about the job being killed.

Hacked a bit the test to output to a /tmp/loglog file, then saved the good and the bad (after including the simple LOG.debug('test debug message').

Only difference I noted was the two entries for test debug message (I believe the first one is when popen is used to execute the command to get Cylc's version). And also the missing output after the housekeeping watcher kills the job due to its timeout.

Can't explain why though. Maybe @matthewrmshin, @hjoliver or someone with more experience with the tests and with the way these commands are executed & killed. Sorry.

kinow on 17 Jan 2019

I've also reproduced this. Strange - ~~the test suite seems to function correctly~~, and it correctly identifies that foo.1 failed job submission, but the job submit command output only contains the new debug message and reports success status (0).

hjoliver on 17 Jan 2019

tests/job-submission/16-timeout.t really tests process pool kill-on-timeout, not "job submission timeout" as the name might suggest. It changes the at job submission template to sleep 30 (i.e. it doesn't actually submit a job) and sets (via global config) a low process pool timeout so the job submission process gets killed after only 10 seconds. Then cylc is supposed to log that it killed a managed subprocess (in this case a job submission process), and react accordingly (in this case, set the associated task to the submit-failed state).

hjoliver on 17 Jan 2019

With no LOG.debug() in the Popen wrapper, the test works as just described. Immediately after the "job submission", a bunch of main loop "Performing suite health check" lines appear in the log before the process is killed after 10 seconds.

With a LOG.debug() in the wrapper, the "job submission" process does not get killed - it runs for the full 30 seconds, which presumably explains the 0 success status. Moreover, there are no "Performing suite health check" lines logged by the main loop during the 30 seconds - it seems like the LOG call causes the main Cylc process to block and wait on the subprocess??!!

So that's interesting, and hopefully it provides a clue as to what's happening.

Bed time here now though. Maybe @matthewrmshin or @oliver-sanders can follow-up...

(BTW this same test is also failing on master in my environment, but for a less nefarious reason: the job submit command ERROR is logged twice for some reason).

hjoliver on 17 Jan 2019

I am trying to repeat this one on master at the moment.

matthewrmshin on 17 Jan 2019

I now suspect that we are still getting a blocking pipe from the logic of #2876.

matthewrmshin on 17 Jan 2019

fyi I did have to patch the 16-timeout test so that egrep worked correctly

MartinRyan on 18 Jan 2019

👍1

[UPDATE, (HJO) - __the theory espoused in this comment, and followed up in subsequent comments, is completely wrong!__]

Here's the reason why adding logging statements to @MartinRyan 's subprocess wrapper can screw up tests/job-submission/16-timeout.t (this is distinct from the blocking pipe problem, I think).

Jobs are submitted via the cylc jobs-submit command in a managed subprocesses (by which I mean, with our own subprocess machinery). That command in turn executes the actual jobs in managed subprocesses, using the same cylc library code.

We therefore have two different processes (the main process, and the cylc jobs-submit process) making the same log calls and writing asynchronously to the suite log , which can result in inter-mixed log lines (I've seen it). And this can mess with cmp_ok and contains_ok tests on the suite log.

hjoliver on 23 Jan 2019

👀1

Ooohh, that sounds more like what could be happening. While testing yesterday I did a os.getpid to check the process, and noticed different PID's... but after that I couldn't really understand why this could be a problem. I thought the logging module worked fine with threads and processes. Could it be because we have a custom file handler?

kinow on 23 Jan 2019

I'm not if there are any other log calls from code used in the scheduler and in cylc jobs-submit. If there are, this could be another reason for test flakiness.

Not sure what to do about it.

ensure there are NO log calls in that common code?
ensure that only the main process writes to the log?
(or other processes write to a different log?)

hjoliver on 23 Jan 2019

https://docs.python.org/2/library/multiprocessing.html#logging

the logging package does not use process shared locks so it is possible (depending on the handler type) for messages from different processes to get mixed up.

hjoliver on 23 Jan 2019

Sounds like fun! Though I suspect there should be something in logging, or another module that abstracts that away, and we don't have to implement any kind of special handling for multiprocessing in Cylc?

kinow on 23 Jan 2019

Yup, surprised it didn't happen before. multiprocess has been around in Cylc for a while... so maybe it happened now as we are adding more logging to the code?

kinow on 23 Jan 2019

We probably got away with it because our subprocess module is quite minimal (ish) and the commands we execute in subprocesses (main job submit, poll, and kill) do not use much library code beyond that.

If that's the case (not much code involved) the easiest solution might be just to have the CLI tools like cylc jobs-submit pass an argument to tell subprocess code not to write to the main log (and we could centralise that distinction in our logging module). ??

hjoliver on 23 Jan 2019

If I remember correctly, logging in cylc jobs-submit is now handled using the StreamHandler writing to STDERR, which is unbuffered. (We did not use logging before in these commands.) This means that when we poll the stderr pipe of the command's subprocess.Popen instance, we get a positive result, triggering the bug :bug:.

matthewrmshin on 23 Jan 2019

... triggering the bug.

You're mean the blocking bug here, right, not the newly-identified asynchronous logging bug?

hjoliver on 23 Jan 2019

The job submission command should not write to the main suite log. Has this branch changed that?

matthewrmshin on 23 Jan 2019

If I remember correctly, logging in cylc jobs-submit is now handled using the StreamHandler writing to STDERR, which is unbuffered...

Oh yeah, I forgot about that (I guess @matthewrmshin actually had this sussed all along @kinow!)

So perhaps @MartinRyan's branch change that?

hjoliver on 23 Jan 2019

Oh yeah, I forgot about that (I guess @matthewrmshin actually had this sussed all along @kinow!)

Oh, interesting!

Has this branch changed that?

I think so... from what I remember while debugging with the IDE, sprocess is used in this branch as the wrapper for the Popen. Adding the LOG.debug statement there, effectively meant logging for every command that needed to use Popen. So I think we would have to make sure to avoid that, or prevent using the cylc.LOG module with sprocess.py? Or even re-think sprocess?

kinow on 23 Jan 2019

Here's my proof of inter-mixed log lines:

2019-01-23T22:12:28+13:00 DEBUG - Performing suite health check
2019-01-23T22:12:29+13:00 ERROR - [jobs-submit cmd] cat /home/oliverh/cylc-run/knob/log/job/1/foo/01/job | cylc jobs-submit --debug --user=osboxes --remote-mode -- /home/osboxes/cylc-run/knob
/log/job 1/foo/01
        [jobs-submit ret_code] -9
        [jobs-submit err]
        2019-01-23T22:12:19+13:00 DEBUG - XXXXX:sleep 30

        killed on timeout (PT10S)
2019-01-23T22:12:29+13:00 ERROR - [jobs-submit cmd] cylc jobs-submit --debug --user=osboxes --remote-mode -- /home/osboxes/cylc-run/knob/log/job 1/foo/01
        [jobs-submit ret_code] 1
        [jobs-submit out] 2019-01-23T22:12:29+13:00|1/foo/01|1
2019-01-23T22:12:29+13:00 INFO - [foo.1] -(current:ready) submission failed at 2019-01-23T22:12:29+13:00

Notice DEBUG - XXXXX:sleep 30 appears in the middle of the command err log line.

That comes from LOG.debug("XXXXX:" + str(cmd)) just before Popen() in Martin's sprocess:pcylc() wrapper function.

hjoliver on 23 Jan 2019

(And sleep 30 is what 16-timeout.t tests replaces the at command with. )

hjoliver on 23 Jan 2019

@matthewrmshin - is it just that we deliberately write to stderr in cylc jobs-submit (and the library code that supports it) instead of loggin, NOT that we altered logging config to redirect all log calls to stderr, in that context? If so, Martin didn't change anything, he just added a logging call in that code. It would be nice to do the "redirect" thing, so that the library code can just call the logger without caring if it lives in the main process or a subprocess...

hjoliver on 23 Jan 2019

(Doesn't look like anything important changed on #2872

hjoliver on 23 Jan 2019

(Doesn't look like anything important changed on #2872

I think the wrapper around the Popen _contributed_ to this issue. If it replaces all occurrences of Popen in the code, it could be significant change I think.

I agree that it would be nice to do the redirect, or something else to prevent other devs (like myself) from accidentally using a logger where it's not supposed to be used.

kinow on 23 Jan 2019

I think the wrapper around the Popen contributed to this issue. ...

Yes, sorry, I meant "other than the wrapper", which (unlike the previously separate Popen calls) is shared between main and subprocesses.

hjoliver on 23 Jan 2019

I suppose the quick fix is just to add a "no_log" arg to Martin's wrapper, and use that in all calls from lib/cylc/batch_sys_manager.py - which is the library code that underlies the sub-process commands.

And comment in batch_sys_manager exactly why it is that we do not make log calls there.

Code shared by main and subprocess commands:

the new wrapper
lib/cylc/subprocpool.py (better check the few log calls in there are OK?)
any other? (probably not much, if any).

hjoliver on 23 Jan 2019

(Gotta go address CISE article feedback from @sadielbartholomew now...)

hjoliver on 23 Jan 2019

👀1

I still don't understand the issue here. From how I understand the (original) system, the only thing that writes to the suite log should be the suite server program. The sequence of event should be like this:

Suite server program writes to suite log.
Suite server program invokes `cylc jobs-submit``.
cylc jobs-submit being an external command that can run on localhost or remote host does not write to suite log. Instead it communicates with suite server program only with STDOUT and STDERR. STDOUT is chosen for the normal output for job submission statuses, etc. STDERR is chosen for any extra diagnostics message. Therefore, its normal logger is only attached to a StreamHandler writing to STDERR.
cylc jobs-submit writes stuff to STDOUT and STDERR.
Suite server program captures these output either while cylc jobs-submit is running or when it completes.
STDOUT and STDERR (including log messages) from cylc jobs-submit is written to the suite log by the suite server program when the command completes.
Note: the logic handling task submission failure may write something extra to the log. Is that what is also adding to the confusion?

matthewrmshin on 23 Jan 2019

I agree (now that you've reminded me!) that that's what's supposed to happen. Except I'm hazy on the Streamhandler bit. Is that a general "redirect" so that any call to log.debug("blah") in the other process will actually go to stderr, or not?

Note: the logic handling task submission failure may write something extra to the log. Is that what is also adding to the confusion?

I don't think my transcript above conforms to your narrative!

https://github.com/cylc/cylc/issues/2929#issuecomment-456737322

... it certainly looks like my log.debug("XXXXX" + cmd) is printing to the log in the middle of a main process log string.

hjoliver on 23 Jan 2019

Yes. Logic here:

https://github.com/cylc/cylc/blob/1c55f5b4b066775b6f261556605f867a78488607/lib/cylc/option_parsers.py#L288-L295

Logging for all CLI commands goes to STDERR by default. The suite server program overrides this here when starting a daemon:

https://github.com/cylc/cylc/blob/1c55f5b4b066775b6f261556605f867a78488607/lib/cylc/scheduler.py#L343-L351

OK, let's analyse your log.

The following was logged as the command completed (return code -9):

2019-01-23T22:12:29+13:00 ERROR - [jobs-submit cmd] cat /home/oliverh/cylc-run/knob/log/job/1/foo/01/job | cylc jobs-submit --debug --user=osboxes --remote-mode -- /home/osboxes/cylc-run/knob
/log/job 1/foo/01
        [jobs-submit ret_code] -9
        [jobs-submit err]
        2019-01-23T22:12:19+13:00 DEBUG - XXXXX:sleep 30

        killed on timeout (PT10S)

The following was logged by the logic that handled the task submission failure. Note the made-up return code 1 to represent the submission failure of the individual task. (The cylc jobs-submit command may have good/bad submits for many tasks.)

2019-01-23T22:12:29+13:00 ERROR - [jobs-submit cmd] cylc jobs-submit --debug --user=osboxes --remote-mode -- /home/osboxes/cylc-run/knob/log/job 1/foo/01
        [jobs-submit ret_code] 1
        [jobs-submit out] 2019-01-23T22:12:29+13:00|1/foo/01|1
2019-01-23T22:12:29+13:00 INFO - [foo.1] -(current:ready) submission failed at 2019-01-23T22:12:29+13:00

matthewrmshin on 23 Jan 2019

👍1

Arggh, that is pretty convincing - thanks for the excellent explanation.

I didn't find the log redirect logic earlier (before you mentioned StreamHandler, anyway) because option_parsers.py did not occur to me as a place to look!

In that case, I suppose we just have to make sure that any suite log comparison tests are careful to ignore any "extra" err output - which unfortunately can run over multiple lines as above.

hjoliver on 23 Jan 2019

Should we have multiple loggings instead of a single multi-line logging for external commands and its outputs, etc?

matthewrmshin on 23 Jan 2019

👍1

That may be a good idea, to avoid messing with the tests anyway. If we don't need to worry about intentionally multi-line messages...

hjoliver on 23 Jan 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Traceback with cylc play and Cylc 7 workflows

oliver-sanders · 3Comments

cylc review: Server 404 error when trying Display Options for suites contain special characters

kinow · 3Comments

symlink dirs: move into [install] section

oliver-sanders · 3Comments

make main loop more asynchronous

oliver-sanders · 4Comments

Parallel cylc trigger edit problem

dpmatthews · 3Comments