Hi!
Testing #3053 with Docker, running 2 containers. One cylc where I have suites and run local background tasks, and another pbs where I run pbs tasks. Both using same version, latest from 7.8.x: 7.8.1-139-gbe1477.
The cylc kill documentation says:
To kill one or more tasks, "cylc kill REG TASKID ..."; to kill all active
tasks: "cylc kill REG".
But running `cylc kill ${SUITE_NAME}" is actually returning an error:
testuser@cylc:/opt/cylc$ cylc run pbs1
._.
| | The Cylc Suite Engine [7.8.1-139-gbe1477]
._____._. ._| |_____. Copyright (C) 2008-2019 NIWA
| .___| | | | | .___| & British Crown (Met Office) & Contributors.
| !___| !_! | | !___. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
!_____!___. |_!_____! This program comes with ABSOLUTELY NO WARRANTY;
.___! | see `cylc warranty`. It is free software, you
!_____! are welcome to redistribute it under certain
*** listening on https://cylc:43090/ ***
To view suite server program contact information:
$ cylc get-suite-contact pbs1
Other ways to see if the suite is still running:
$ cylc scan -n 'pbs1' cylc
$ cylc ping -v --host=cylc pbs1
$ ps -opid,args 2437 # on cylc
testuser@cylc:/opt/cylc$ cylc kill pbs1
Traceback (most recent call last):
File "/opt/cylc/lib/cherrypy/_cprequest.py", line 679, in respond
response.body = self.handler()
File "/opt/cylc/lib/cherrypy/lib/encoding.py", line 230, in __call__
self.body = self.oldhandler(*args, **kwargs)
File "/opt/cylc/lib/cherrypy/lib/jsontools.py", line 63, in json_handler
value = cherrypy.serving.request._json_inner_handler(*args, **kwargs)
File "/opt/cylc/lib/cherrypy/_cpdispatch.py", line 66, in __call__
raise sys.exc_info()[1]
HTTPError: (404, 'Missing parameters: items')
Bad return code: https://cylc:43090/kill_tasks?: 404 Client Error: Not Found for url: https://cylc:43090/kill_tasks
My suite definition:
[cylc]
[[reference test]]
required run mode = live
live mode suite timeout = PT5M
[scheduling]
[[dependencies]]
graph = a:start => b
[runtime]
[[a]]
script = """
for i in {1..60}
do
echo "Running..."
sleep 2s
done
"""
[[[remote]]]
host=pbs
retrieve job logs = True
[[[job]]]
batch system = pbs
[[[directives]]]
-W umask=0077
[[b]]
script = cylc poll "$CYLC_SUITE_NAME" 'a'
# from Cylc 7.8.0 code, cylc/tests/cylc-poll/07-pbs
I can confirm it's running by looking at the logs in /var/spool/torque/spool, and also qstat on the pbs node. Other cylc commands also work fine:
testuser@cylc:/opt/cylc$ cylc scan
pbs1 testuser@cylc:43041
testuser@cylc:/opt/cylc$ cylc dump pbs1
last_updated=1554771547.58
namespace definition order=['a', 'b', 'root']
newest cycle point string=1
newest runahead cycle point string=None
oldest cycle point string=1
reloading=False
run_mode=live
state totals={'running': 1, 'succeeded': 1}
states=['running', 'succeeded']
status_string=running to stop at 1
suite_urls={'a': '', 'suite': '', 'b': '', 'root': ''}
time zone info={'hours': 0, 'string_basic': 'Z', 'string_extended': 'Z', 'minutes': 0}
a, 1, running, spawned
b, 1, succeeded, spawned
And running cylc kill pbs1 * returns EXIT_CODE=0, but does not kill any tasks. I think it would be better to either print something like the server response. Otherwise we only know that the command was successfully sent to the server, and not if anything was actually killed. Meaning we still need to look at logs, check queue status, etc.
Running cylc kill pbs1 a.* works fine. And the PBS logs show that it was terminated.
Running...
Running...
Running...
Running...
Running...
Running...
Running...
==> 16.pbs.ER <==
Terminated
2019-04-09T02:04:06+01:00 CRITICAL - failed/TERM
Not sure whether it's necessary for 7.8.x, so setting for later.
This error might result from conversion to the newer more powerful client command line syntax some time ago, because the old usage cylc kill SUITE <no-task-ids> to kill all tasks was non-standard.
We may just need to document use of appropriate wildcards to kill all tasks at once...
The correct syntax is cylc kill SUITE 'root.*' - applies this to all tasks in root family for all cycles. I guess cylc kill SUITE was deemed to be too dangerous at some point?
So I think we can close this issue with a PR to correct the command help, right?
So I think we can close this issue with a PR to correct the command help, right?
Not knowing well how it was/should be, happy to be a reviewer :+1:
Hang on a bit... I just tested this on master, and cylc kill foo did in fact successfully kill all tasks in suite foo.
(But I see your example is on 7.8.x)
(But I see your example is on 7.8.x)
Yup! I did not test on master. And I am sure I tested with PBS (not sure if that matters), but can quickly try again tomorrow morning with PBS and with background schedulers.
Reproduced on 7.8.x (seems odd that we've broken this behaviour on 7.8.x but not master...)
Added the doc fix to #3053 (for 7.8.x). Are we happy for this behaviour to change between 7.8.x and master? If so, there is nothing more to do.
If so, there is nothing more to do.
The traceback above isn't good.
Also, from looking at the code (and master behaviour) we did intend that cylc kill SUITE should kill all tasks in SUITE ... which sounds reasonable to me.
In which case, you want to modify 7.8.x? It is a simple change to the network API (cylc.network.httpserver). The handler logic for kill should just match that of poll when a None value is given to items.
Yes I'll do it.
On Wed, 10 Apr 2019, 08:59 Matt Shin, notifications@github.com wrote:
In which case, you want to modify 7.8.x? It is a simple change to the
network API (cylc.network.httpserver). The handler logic for kill should
just match that of poll when a None value is given to items.—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/cylc/cylc/issues/3094#issuecomment-481433648, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACQLGUMRwEehQT8_q_nlhj_gdZk8h1adks5vfP9CgaJpZM4cjYgs
.
Done.
Closed by #3100.
Added the doc fix to #3053 (for 7.8.x).
@matthewrmshin - just checking, presumably you backed out this commit?
Yes