I'm working on an experimental plugin for Slurm to launch a Flux instance along with every Slurm job. To facilitate adding the correct FLUX_URI to the environment of a user's job (and all job steps), it would be nice to be able to use a predetermined FLUX_URI based on the Slurm jobid.
This can be accomplished via the local-uri broker attribute, however this inconveniently requires that the local-uri (which is a directory) be pre-created on all nodes on which brokers will be run. This means that the plugin will first have to pdsh or srun a mkdir across all nodes of a job, then launch the flux instance with srun flux start ....
Ideally, flux-broker would create missing components of local-uri when this broker attribute is set on the command line, as it does when a random local-uri/rundir path is chosen internally. It may be that there is a good reason this can't be done easily in the broker (e.g. race conditions on mkdir), and if it is a problem we can look at other methods to use a predetermined URI.
Allowing flux-broker to choose a random path for local-uri will probably not work because our plugin will not be able to easily query the chosen FLUX_URI to propagate to the job tasks. Furthermore, the FLUX_URI would be different on every node, complicating the task of transparently allowing access to Flux for all job tasks.
Looks like currently the local-uri is set to broker.rundir by default, unless local-uri broker attribute is explicitly overridden on the cmdline.
If broker.rundir attr is set, then it is validated to be a directory with write permission. If not set by the user, the default is to create a tmpdir with mkdtemp(3), and to schedule the directory for removal.
In either case, if local-uri broker attribute is not overridden by the user, the local-uri == broker.rundir.
Note one side-effect here: if a custom local-uri is chosen, the directory has to be pre-created and it will not be auto-removed at instance shutdown.
The eventual solution I'd prefer here is to allow broker.rundir to be set on the commandline and request that the directory be created if it does not exist and schedule the directory for auto-cleanup. If the directory was created by at least one broker it will be auto-removed at instance shutdown, otherwise the existing behavior is preserved.
I'll go ahead and try to make the above trivial change.
Yup, as discussed, that sounds good. Just mkdir() the rundir, and if that succeeds, then add it to the cleanup stack, and it if fails with EEXIST, ignore and proceed. In both cases allow the state, S_ISDIR, S_IRWXU checks to take place.
Oops. Of course, this can't work with multiple brokers per node, since each broker's connector-local module wants to create a socket in ${broker.rundir}/local. In fact, setting broker.rundir is how flux-start seems to support multiple ranks per node:
$ src/cmd/flux start -v -s 4
flux-start: 0: flux-broker --setattr=broker.rundir=/tmp/flux-29871-037dgz/0 --setattr=tbon.endpoint=ipc://%B/req
flux-start: 1: flux-broker --setattr=broker.rundir=/tmp/flux-29871-037dgz/1 --setattr=tbon.endpoint=ipc://%B/req
flux-start: 2: flux-broker --setattr=broker.rundir=/tmp/flux-29871-037dgz/2 --setattr=tbon.endpoint=ipc://%B/req
flux-start: 3: flux-broker --setattr=broker.rundir=/tmp/flux-29871-037dgz/3 --setattr=tbon.endpoint=ipc://%B/req
Overriding broker.runder in flux-start selfpmi mode breaks the per-rank directories and causes a hang.
I don't see a great solution here.
Some ideas, none of which seem great:
FLUX_RANK also be set, or different FLUX_URI per rank?)broker.rundir as is already done for endpoints. Then we could start flux with something like flux start -o -S,broker.rundir=/tmp/xxzzyy/%R to get per rank directories (when necessary), but still be know before hand what the FLUX_URI for each rank will be...Sorry no better ideas yet.
I hate to add a new attribute, but what if we added an optional instance.rundir that is presumed shared by an entire session (per node, of course). If set, then broker.rundir is set to ${instance.rundir}/${rank}. The instance rundir follows the logic set out above, i.e. it is created and cleaned up if it doesn't exist during broker startup. If instance.rundir isn't set then it could either be ignored, in which case broker.rundir behaves as it does now, or a random instance.rundir could be created with mkdtemp(3) (though that doesn't seem quite right since it won't really be shared by the flux instance).
Perhaps the instance.rundir attr name is a bit confusing, since the directory is really only shared per node. Maybe broker.datadir or something?
Oh, blah. The broker doesn't have its own rank before it needs to create rundir
Well, maybe needs is too strong a word. Currently rundir is created before booting with PMI, but I'm not sure that is strictly necessary. I'll try moving create_rundir() to after we have the broker rank available from PMI.
Ok, @garlick set me straight. The broker gets its rank early in PMI boot, but as part of PMI brokers need to exchange endpoints, and the endpoint creation may require that broker.rundir be set (if %B is used in the tbon.endpoint string).
So, the next approach to try is to create the broker.rundir during PMI boot after the broker has its rank, but before the TBON endpoint is created.