Follow on from #3686
This should involve the creation of interfaces (preferably implemented as Python entry-points to permit in-house extension) for:
We will need to allow these to be configured separately e.g:
[platforms]
[[one]]
hosts = a, b, c
method = random
[[two]]
hosts = d, e, f
method = cycle
[platform-groups]
[[three]]
platforms = one, two
method = random
[[four]]
platforms = two, one
method = cycle
The following selection methods would be desirable but only the interface(s) need(s) to be implemented to close this issue:
In the event that a host is not available (e.g. for job submission) Cylc will need to pick an alternative host from the specified platform or platform group.
Here is a purely illustrative example to explain what is meant by this:
def host_from_platform_group(group):
for platform in group:
for host in platform['hosts']
yield host
def submit_job(job):
for host in host_from_platform_group(job['platform']):
try:
submit(job, host)
break
except (PlatformLookupError, HostSelectError, socket.gaierror, SomeTimeoutError):
continue
This functionality will be required in a lot of different places (e.g. remote-init, job submission, job polling) so it would make sense to centralise it.
Issue #3762 will see the global config reloaded at a set interval for the lifetime of the scheduler. Any selection logic should be robust to this, the list of platforms in a group and hosts in a platform are volatile and may change as sys admins move workload around a system.
@wxtim and I had a talk through the "intelligent fallback" part of this issue which has raised some questions...
The basic premise of this logic is that if a host goes down, rather than just failing Cylc can re-try the operation on another host.
So how do we know if a host is down, there are two options:
1) Test the connection up-front.
* This is what the current rose host-select logic does (it executes a short bash script on the remote host and sends some numbers back).
* Testing connection to every host, every time we want to run a remote command is an overhead.
* Even if SSH has been configured to reuse connections via "control persist" it is still an extra subprocess call.
* These connection test processes could overwhelm the process pool, especially if there are connection issues.
2) Perform the operation (e.g. job submission) and diagnose any failure.
* SSH exits 255 in the event of comms issues so could we forgo connection testing and just use this signal?
* Saves an un-necessary connection/process.
However, not all issues are comms based, for example, what if the platform is not accepting new jobs, say because the queue is full or closed? This seems like the sort of thing Cylc should be able to handle gracefully. In this case there is no point trying another host within the platform, however, it may be worth trying another platform within the group.
Should we:
3) Only handle SSH failure.
* Any other errors are too involved for Cylc to handle.
4) Provide an interface to the batch_sys_handlers allowing them to diagnose failures.
* E.G. by checking return codes or scraping command output.
* Note that some failures may imply a bad host whereas others may imply a bad platform.
Anything I've missed out @wxtim?
Hmm, complicated :grimacing:
I have too many questions about this - might be a good one to discuss at next meeting?
I think that is a fair summary of what we discussed. I have just spent some time looking at the code that was bothering me - it's still not completely clear how job submission might work, but I think ultimately it still comes down to a question of how we can tell a submit failure we want to retry with different platform settings with one to allow to stand.
I vote for "Only handle SSH failure".
This already provides a massive improvement over cylc 7.
If we want to try to go further than this then I think this should be a future enhancement (not required for cylc 8).
Ok, I vote (2,3), don't perform a trial connection, detect 255 error codes.
I've got a proof of concept branch where I've hacked the remote_init and host_from_platform methods to allow us to keep trying hosts until we get one we like. I haven't implemented any of the other procedures requiring it though, so the example suite will stall after remote init.
I had a discussion with @oliver-sanders (feel free to edit this post to make it more accurately reflect our discussion) yesterday, where I talked about the fact that centralizing the logic looks a little tricky because you need to store state information, but also to housekeep it. This discussion generated multiple approaches:
We have to preserve the state of the selection process somewhere, either globally accessible (i.e. via a unique id) or locally scoped via some other mechanism.
The state is effectively composed of a list of platforms which have been tried and a list of hosts within each platform which have been tried. The easiest way to preserve the state is probably just to use a generator (as they hold their state within their scope until destroyed). Take for example this reference implementation:
def select_generator(platform_group):
platforms = platform_group['platforms']
while platforms:
platform = select(platforms, platform_group['method'])
hosts = platform['hosts']
while hosts:
host = select(hosts, platform['method'])
yield host
hosts.remove(host)
platforms.remove(platform)
The question is how to hook that up to the call/callback framework into which it must fit. Simplified version:
def controller():
proc_pool.call(
call_remote_init,
callback_remote_init
)
def call_remote_init(*args):
pass
def callback_remote_init(*args):
pass
Using global state storage it would look something like:
def call_remote_init(id, *args):
#聽note the store is some session/globally scoped object
if id:
# retrieve the state from the store
gen = store[id]
else:
id = uuid() # selection id
# put the state into the store
store[id] = select_generator(platform_group)
gen = store[id]
try:
host = next(gen)
except:
# no available hosts - remote-init failure
pass
# ...
def callback_remote_init(id, *args):
if returned_a_255_error_code:
call_remote_init(id, *args)
return
else:
del state[id]
# ...
This would do the job, however, the pattern would have to be reproduced for each call/callback pattern (remote_init, job_submission) and kinda feels messy. The main drawback is that this state store must be housekept (the del state[id] bit) else we will have a memory leak.
If this code were all async we would not need the state store.
async def remote_init(*args):
for host in select_generator(platform_group):
result = await proc_pool.call(...)
if returned_a_255_error_code:
continue
else:
break
else:
# no hosts - remote-init failure
pass
Nice and clean, but re-writing the subprocesspool is a bit much right now. So how to bridge a call/callback pattern to an async pattern....
async def remote_init(*args):
for host in select_generator(platform_group):
#聽pass a future object into the call/callback
future = asyncio.future()
call_remote_init(*args, future)
await future
if returned_a_255_error_code:
continue
else:
break
else:
# no hosts - remote-init failure
pass
def call_remote_init(future, *args):
pass
def callback_remote_init(future, *args):
# in the callback mark the future as done, this returns control to remote_init
future.done()
I'm not sure how to approach this, pros and cons. The async approach might be nice, however, the async code doesn't currently reach down very far from the Scheduler so would involve adding a lot of async statements along the call stack just to be able to call remote_init in that way.
On the 22/04 @oliver-sanders Said:@
Intelligent Host Selection - Job Submission
The idea is Cylc tries to submit the job to one host, if that fails 255 it tries the next one, if it runs out of hosts it puts the job into a submit-failed state.
This means the job submission system has to remember the hosts it has previously tried. At the moment on Tim's branch this state is centralised which makes sense. This way Cylc doesn't try to use hosts which is knows are down. This is good for efficiency and also prevents flooding the system with SSH commands that are sitting around before they hit their timeout.
Problems:
We would have to wipe this global state at some point otherwise hosts which have come back will not be considered for any operation.
If there is a transient network problem all hosts could be added to the blacklist and the workflow would sit around doing nothing until the state is wiped.
Best solution I can come up with is:
Reset this state periodically (use cylc.flow.main_loop.periodical)
AND also reset the state whenever a submit-failed task is retriggered either manually or by a scheduled automatic retry.
Thoughts?
But after a discussion between me and @dpmatthews this morning the following is proposed:
get_hosts_from_platforms fails because set(hosts) - set(badhosts) == {}, remove platform['hosts'] from badhosts before raising an error. As a result future submissions using this platform should start with a blank slate:Can you see any issues @oliver-sanders ?
Additional question: If get_host_from_platform fails at job-submit, then this is clearly a job-submit failure. What happens if fails at Remote init or fileinstall? Do we retrun a full bodied error? Or try to handle it?
Failed remote init or file install should log the full error, I should think, so the user can see what's gone wrong and either fix it or alert system admins. I don't think we could handle this kind of error automatically?
Most helpful comment
Ok, I vote (2,3), don't perform a trial connection, detect 255 error codes.