Flux-core: Feature: create MPI tasks for a job step based on the number of allocated nodes

Created on 16 Jan 2020 · 13Comments · Source: flux-framework/flux-core

It would be nice if I could specify an argument to mini run that would tell Flux to create MPI tasks based on the number of nodes the job step will be running on. Something like flux mini run --total-cores=100 --tasks-per-node=2, with the intention that the job step is allocated 100 cores, and on each node where cores are allocated, two MPI tasks are created.

My use case is as follows:
From within a single large allocation, I want to run multiple batch scripts (really, any script that launches applications via flux mini) in parallel. I was planning on giving each batch script its own sub-allocation by creating a nested flux instance. However, it seems that the flux-broker launched to create a sub-allocation requires exactly one task per node, otherwise strange shared-memory issues result when the batch script's applications try to use MPI.

I guess this kind of argument to flux mini might also be useful for applications that rely on thread-parallelism rather than message-passing, but I mainly just want to use it for creating nested Flux instances.

Oh yeah, and Stephen mentioned that a more general feature might be coming (in jobspec 2.0, I think) related to specifying jobs based on generalized 'resources,' in which case I could declare that I wanted one task per 'resource,' and specify that by 'resource' I mean 'node'; that sounds perfect.

Source

jameshcorbett

All 13 comments

Thanks - it's super useful to get real use cases as issues at this stage!

Can somebody post the v1 jobspec for this and then we can discuss the tooling? Sounds like @stevwonder may have some ideas on this, but a stopgap could be to push the jobspec through flux job submit (as JSON).

garlick on 16 Jan 2020

The non-minimal flux-run tool proposed by @trws here might be able to do what you want, however to @garlick's point, I'm not sure jobspec v1 is capable of expressing this kind of "per resource" request.

grondo on 16 Jan 2020

👍1

I actually think this use case is pretty important in the short term.

Every "batch" job (i.e. every job) in the system instance will be launching a subinstance, and thus the tasks section for all these jobs will need to make use of per_resource or we'll need some other kludge to force the job shell to start one flux-start per allocated node.

Maybe we want to consider tweaking the jobspec V1 to add per_resource, and make the associated changes to the job-shell sooner rather than later?

grondo on 16 Jan 2020

@grondo:

@jameshcorbett posted this issue per my request after our UQP/Flux meeting yesterday. While he may have a less effective work around, it thought it makes sense to capture this real-world use case.

I believe this goes beyond the jobspec V1 support. @SteVwonder and talked about this, but we weren't sure if flux-sched could support that. So it would be good to flesh out some details both for specification and scheduler internals.

dongahn on 16 Jan 2020

if flux-sched could support that. So it would be good to flesh out some details both for specification and scheduler internals.

I thought scheduler largely ignored the tasks section? Isn't this just a requirement for the job shell to be able to concretize the number and distribution of tasks after allocation?

(Apologies if I've misunderstood the request)

grondo on 16 Jan 2020

I thought scheduler largely ignored the tasks section? Isn't this just a requirement for the job shell to be able to concretize the number and distribution of tasks after allocation?

Yes the scheduler ignores that section. However, if the resource section looks like the following (from @SteVwonder), (pseudo spec though)

- node: 1+
    - slot: 1
        - core: 200

It may not be able to make use of that spec. The slot would first match with a real resource vertex, say, node and then the scheduler will find "there is no node which has 200 cores underneath it.

And @SteVwonder and I weren't even sure if the spec like this is comply with cannonical jobspec.

My proposal is to have some concrete job specs like this and reason about them in terms of Jobspec V1.1 and our scheduler.

dongahn on 16 Jan 2020

Ah you are right. The straightforward approach would be to leave off the node resource and just request 200 cores with no slot defined:

 - core: 200

With the tasks section defining per_resource: node.

At least the resource section seems to be valid V1 jobspec (at least I don't see slot as a requirement, though I could have missed it)

Since this is going to be a common case, we should strive to make the common cases simple and the uncommon cases possible.

grondo on 16 Jan 2020

Ah you are right. The straightforward approach would be to leave off the node resource and just request 200 cores with no slot defined:

I have to test this, but I think this still won't match in flux-sched. I think this is doable by changing how a jobspec is interpreted by our scheduler, though. If this would be the jobspec needed to support this, we need to open up an issue at flux-sched and look into this.

At least the resource section seems to be valid V1 jobspec (at least I don't see slot as a requirement, though I could have missed it)

We actually discussed this and we decided to have slot as mandatory. (I will try to find a discussion thread) This can change of course. But as is, flux-sched won't like a resource section without slot. If this is needed, we need to open up an issue at flux-sched.

dongahn on 16 Jan 2020

Oh, I had forgotten, this would be the way to do it with a slot obviously:

 - slot: 200
   - core: 1

However, then the tasks section would indicate to the job shell to ignore slots by specifying per_resource

grondo on 16 Jan 2020

👍1

@grondo: Maybe I should test flux-sched a bit and propose the path to least resistance. I don't think it is difficult at all to change it so that I can handle a slot-less resource section. Maybe that's easier than dealing with:

 - slot: 200
   - core: 1

dongahn on 16 Jan 2020

Yeah, I have a Jobspec V2 RFC draft laying around somewhere that I need to dig up and post. It is intended to handle all of the new functionality exposed by flux run, and It also covers this use-case. As already mentioned, the big change is the addition of per_resource to the task section.

SteVwonder on 16 Jan 2020

👍1

I'm coming to this late, but yeah I would expect this to be the 200-core request with no node. It occurs to me that I don't think we can easily express "200 cores on some number of nodes with some properties", no way to say "I want a total of # resources at this level regardless of expansion" or similar, at least not without explicitly specifying edges. Might be worth a thought, but for now yeah you'd have to ask for 200 cores, then use per-resource to get there.

trws on 17 Jan 2020

The only real use-case I had for this (of nesting flux instances) would be covered by #2962.

jameshcorbett on 3 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[spectrum mpi] need to supppress OMPI_, JSM_, and PMIX_ environment

SteVwonder · 7Comments

libflux: change flux_future_error_string() to return flux_strerror() if textual error was not set

chu11 · 6Comments

Auto-generating keys for first-time users?

SteVwonder · 4Comments

job-manager: allow specific job id's to be listed

garlick · 8Comments

increase minimum jansson version

chu11 · 3Comments