Cylc cat-log seems to leave behind tail processes on suite servers.
These should be wrapped in a timeout command to prevent these being left behind.
Huh, I thought we had this fixed. You should be able to run cylc cat-log on host A, which re-invokes itself on suite host B, which runs a tail process on job host C ... and if you Ctrl-C out of it in your terminal on host A, all processes will get killed automatically on hosts B and C as well.
Can you test this manually in your environment? If that works as expected, next step would be to figure out what causes the left-behind processes, perhaps from GUI-invoked "view job log" processes? Then yes, it might be that timeout-wrapping or similar is a good idea.
I can't explain all the instances we see of this. I'm almost certain they all come from GUI invoked view job log processes. I have found that if I run the GUI, view an output file and then kill the GUI process the cylc cat-log process continues running which is clearly not a good thing.
If we ran the tail command with a timeout command I think this would solve (or at least workaround) the problem and would be a sensible thing to do in any case. We could use a fixed timeout of, say, 1 hour initially and add an extra option to configure the timeout length later if there is demand?
If we ran the tail command without a timeout command
With?
I have found that if I run the GUI, view an output file and then kill the GUI process the cylc cat-log process continues running which is clearly not a good thing.
OK, I haven't tested this myself yet, but that in itself is enough to warrant a timeout.
Tentatively put against 7.8.x (maybe 7.8.5) because this causes problems now and it might be an easy fix.
I have found that if I run the GUI, view an output file and then kill the GUI process the cylc cat-log process continues running which is clearly not a good thing.
OK, I haven't tested this myself yet, but that in itself is enough to warrant a timeout.
I just tested this with a local suite (running 7.8.x), and killing the GUI kills the cat-log process and its tail subprocess.
However, I also tested (local suite) that tail command template = "timeout 10 tail -n +1 -F %(filename)s" correctly runs the tail subprocess with timeout (if viewing in the GUI log viewer, it stops tail after the timeout) so we could make that the default tail command template.
@dpmatthews - can you give an actual recipe for reproducing the problem? And, does it work with a purely local suite, or do you need a remote job host, and/or have the GUI running on another host as well (i.e., not the suite host)?
I've had another go at reproducing this. I managed to get it to happen via the GUI by viewing a file and then repeatedly changing the file I was viewing. After lots of messing around I've found I can reproduce it quite simply without the GUI.
I just need to have run a trivial suite which runs single local task (the suite doesn't need to be still running). I then use the cylc cat-log -m t command from a different host to view the job.err file (so that the cat-log is run via ssh) and interrupt it (Cntrl-C) as it starts up. If I repeat this multiple times then eventually I get cylc cat-log processes left behind on the suite server which have to be killed manually.
In my tests I can only get this to happen with an empty job.err file. With job.out (which isn't empty) I don't seem to be able to make it happen.
I can't make much sense of this and it's not a priority. I think I'll try changing our site config to include the timeout. We can think about whether changing the default makes sense later.
Interesting :thinking:
Have your tried your reproducible empty job.err case with the timeout command configured, just to make sure that works?
Yes - the spurious cat-log processes disappear after the timeout so it seems an effective workaround.
Closing as won't fix, we have a workaround (use timeout in the command template).
Unfortunately we had to remove the timeout fix. It causes cylc gui to freeze if you try to change which file you are viewing (i.e change to job.out from job.err, or change the submit number).
Re-opening this as a reminder for the moment.
Adding the timeout command only caused issues with the cylc 7 GUI. Therefore I think we can close this issue if we change the default at Cylc 8 to be:
tail command template = "timeout 12h tail -n +1 -F %(filename)s"
(12h open to debate)
Meeting: close for now as a 7.8.x problem, as only seen in combination with the Cylc 7 GUI