On slack, a question was asked about the easiest way to run flux jobs against a subinstance.
We already have 01-enclosing-instance which writes a local_uri and remote_uri to the parent's kvs namespace for the job. However, these values are not accessible from flux jobs, so instead they must be retrieved via flux job info.
Another idea would be to record these values as job annotations, now that we have this feature. That would make the URIs immediately accessible via flux jobs. Here's an example:
#!/bin/bash
# Inform the enclosing instance (if any) of the URI's for this instance
level=$(flux getattr instance-level)
if test $level -gt 0; then
local_uri=${FLUX_URI}
remote_uri="ssh://$(hostname)/$(echo $local_uri|sed 's,^.*://,,')"
flux --parent job annotate ${FLUX_JOB_ID} remote_uri "${remote_uri}"
flux --parent job annotate ${FLUX_JOB_ID} local_uri "${local_uri}"
fi
茠(s=4,d=0,builddir) grondo@asp:~$ flux mini batch -n2 --wrap 'flux mini submit sleep 120; flux mini submit sleep 120; flux mini submit sleep 100; flux queue drain'
50380905906176
茠(s=4,d=0,builddir) grondo@asp:~$ flux jobs -o '{id.f58:>12} {user.local_uri:<32} {user.remote_uri:<32}'
JOBID USER.LOCAL_URI USER.REMOTE_URI
茠PpRR8kWB local:///tmp/flux-GTL4N2/0/local ssh://asp//tmp/flux-GTL4N2/0/local
茠(s=4,d=0,builddir) grondo@asp:~$ FLUX_URI=$(flux jobs -no {user.local_uri} 茠PpRR8kWB) flux jobs
JOBID USER NAME ST NTASKS NNODES RUNTIME RANKS
茠XPkH51 grondo sleep PD 1 - - -
茠RSKAQf grondo sleep R 1 1 48.81s 0
茠KNx6cw grondo sleep R 1 1 49.05s 0
@grondo:
Neat idea!
From today's mini hackathan with the COVID user, I felt it will go a long way if we have a high level command that can walk the entire instance hierarchy and print out job information hierarchically.
Maybe this trick can be extended to make such hierarchical walk automated...
I felt it will go a long way if we have a high level command that can walk the entire instance hierarchy and print out job information hierarchically.
I wonder if there should be a different command that lists job information hierarchically? If an instance hierarchy is required, then presumably there are too many jobs for a single instance, so recursive job listing should not be encouraged. If there is a small enough number of jobs that recursive listing will be scalable (and listing all jobs is a requirement), then perhaps a hierarchy of instances wasn't required.
For the cases where a hierarchy is required to handle a large number of jobs, perhaps a more appropriate tool would display aggregate job information instead of listing jobs recursively? (e.g. count of running, pending, completed, failed..).
I wonder if there should be a different command that lists job information hierarchically?
Yes, for large scale cases, we want this in a different command .
If there is a small enough number of jobs that recursive listing will be scalable (and listing all jobs is a requirement), then perhaps a hierarchy of instances wasn't required.
Yes. But I was thinking this mostly as usability concerns. Currently flux mini batch creates a Flux instance (for good reasons) but this is implicit and users would not know this. And this can be confusing although one can argue users can be trained.
From my recent mini-hackathan, it was clear that the following was what the user wanted to see right after we did the first level flux jobs, but we found there wasn't an easy way. We ended up have our script write FLUX_URI to a file and used flux proxy against it.
console
茠(s=4,d=0,builddir) grondo@asp:~$ flux mini batch -n2 --wrap 'flux mini submit sleep 120; flux mini submit sleep 120; flux mini submit sleep 100; flux queue drain'茠(s=4,d=0,builddir) grondo@asp:~$ FLUX_URI=$(flux jobs -no {user.local_uri} 茠PpRR8kWB) flux jobs
JOBID USER NAME ST NTASKS NNODES RUNTIME RANKS
茠XPkH51 grondo sleep PD 1 - - -
茠RSKAQf grondo sleep R 1 1 48.81s 0
茠KNx6cw grondo sleep R 1 1 49.05s 0
The new syntax to get this is definitely far better than what we did in our recent mini hackathan! From the perspective of regularly batch users with no CS background, though, I was thinking along the line of flux proxy <JOBID> flux jobs would be simpler. This can be a basis for a user script to customize what they want to see hierarchically.
Of course, this won't work if the JOBID isn't a flux instance. So flux proxy should be able to handle this gracefully.
Related: Do we want to mark a Flux instance job as a special case to flux listing tools? There seem to be certain additions things like recursive queries you want to be able to do on such jobs? One can argue users can easily find that by querying the name field though. Maybe a wrapper command that take the JOBID and tells if it is a Flux instance or not can make scripting easier.
I was thinking along the line of flux proxy
flux jobs would be simpler.
This is not a bad idea! In combination with a solution to #2298, flux proxy could use the "guest exec" support to launch its shell instead of ssh, which would drop the need for passwordless ssh/rsh support to nodes, the requirement for a PAM plugin for access (#2533), and even the need for notifying the parent of the "remote" URI (since only the local uri would be required in this scenario).
flux proxy does require a URI as its argument though, so the usage might have to be flux proxy jobid://<jobid>, which is a bit unfortunate. However, perhaps we could add a porcelain command to wrap flux proxy in this case to hide that from users.
Note also that flux proxy is going to be pretty heavyweight for running single commands. I wonder if we had the flux exec --jobid support described in #2298, if the shell exec plugin could somehow optionally set the correct FLUX_URI if it has spawned a child instance of Flux (the shell plugin could "watch" for the child instance to register its local_uri in parent kvs namespace). Then flux exec --jobid=JOBID flux jobs would work as expected. (again perhaps a job-specific porcelain command would be warranted here).
Do we want to mark a Flux instance job as a special case to flux listing tools? There seem to be certain additions things like recursive queries you want to be able to do on such jobs? One can argue users can easily find that by querying the name field though. Maybe a wrapper command that take the JOBID and tells if it is a Flux instance or not can make scripting easier.
This can be done currently by checking flux job info JOBID guest.flux.local_uri, or if we changed the local_uri from a kvs value to an annotation, then if a job has the uri annotation it is a child instance. (it would actually be nice to use an ephemeral annotation if this was supported because there is no use storing the local and remote uris in the kvs after the job has exited).