Flux-core: traditional batch job submission interface

Created on 28 May 2020  路  30Comments  路  Source: flux-framework/flux-core

I thought there was an open issue on this already, but I could not find it. Please close this as a duplicate if someone finds a previous open issue.

A traditional RM "batch" job in Flux can be thought of as a flux-start of a new instance with the "batch script" as the initial program of the new instance. E.g., currently a more traditional looking batch job on 4 nodes would be submitted via

$ flux mini submit -N4 flux start /path/to/batch/script

We'll want an interface that at least makes the flux start implicit, allowing a user to just specify total resource requirement + their batch script.

Some other items of note (off the top of my head)

  • We may need support for "per resource" task slot assignment for this to work. By default, we'll want to allow the user to request any resources, but then run flux-start only once per node or execution target. Perhaps there is some other hack we could use here (like a "batch script" attribute on the jobspec) to make progress without a revision in jobspec.

  • There is not a very clear distinction between batch vs parallel job in Flux, as opposed to more traditional RMs like Slurm. I had thought Slurm could be configured such that partitions can be configured to only allow batch job submission (i.e. sbatch not srun), though I can't find evidence of that now. If there is some similar requirement we will have to be able to differentiate batch jobs in job-ingest.

  • Slurm creates a copy of batch scripts as part of the job record when using sbatch (at least as far as I remember). This means you can submit a script, edit it and submit again and have two different batch scripts submitted. I'm not sure if this is a requirement, but we might think about how a Flux batch submission tool might support this. (i.e. include a batch script in jobspec itself)

Most helpful comment

IMHO we need to walk before we can run anyway.

Heh, I read this at first as "we need to walk before we can run away"

All 30 comments

maybe we can start out with a new flux-mini command: flux mini batch, which takes similar resource specification arguments to other flux-mini commands, but always runs a single flux-start per node by default. I'm not sure if it would be easiest to extend jobspec to add the per-resource: task slot specification, or to have some stopgap way to tell the job shell to start only a single task.

We could then experiment with other batch job functionality, like copying scripts into jobspec.

@grondo:

Thank you for opening up the ticket.

Just throwing an alternative to flux mini batch:

We can add a new option to flux mini submit and flux job submit which starts up a new nested instance to run the provided command within the nested instance.

flux mini submit --nest myscript.sh

submit is typically is the command name to submit a batch job and introducing another command to submit a batch job can be confusing IMHO. Maybe slightly a better idea at the risk of breaking the backward compatibility is to make --nest option (or similar) default and running a parallel application directly into the current instance requires an option:

flux mini submit --unnest my_app.

but always runs a single flux-start per node by default.

BTW, having this capability within flux-core will be extremely users -- making the life of advanced workflow users a lot easier.

Given the aforementioned organization of work, it would be helpful to invoke Flux within a framework indicated as follows:
{flux_path}/flux_wrapper [single/multi] task_file_name
Much like the Unix parallel command, Flux becomes a simple but powerful on-demand, resource controller, providing a simple interface. The user simply assembles the list of tasks, and makes a single call to have them executed. It is lightweight from the execution and preparation perspective - the simplicity of the ASCII tasks file makes them readily generatable from a scripting perspective.
The main extension that is desired from Flux is the ability to nest or recursively call instances of Flux. In other words, a particular task would be a call to Flux.

We don't think we want the user to play with -N -n -c -g trick that I used in flux-tree to ensure they will run one broker per node to enable this.

submit is typically is the command name to submit a batch job and introducing another command to submit a batch job can be confusing IMHO.

The command to submit a batch job with Slurm is sbatch.

Maybe slightly a better idea at the risk of breaking the backward compatibility is to make --nest option (or similar) default and running a parallel application directly into the current instance requires an option:

I think for flux-mini it makes more sense to keep the current behavior of flux-mini submit. Perhaps for flux-submit, the non-minimal frontend tool, the default should be to create a "batch" job.

I worry that the concept of nest/nonest will be too confusing for the typical user, and there will be too many cases of accidentally running batch script as a parallel job (which wreaks havoc since every flux run in the batch script will be submitted to system instance instead of a batch job instance), or accidental runs of "nested" instances when a simple parallel job was desired.

I think we learned the lesson early in Slurm that combining "parallel job" and "batch job" launch into a single command was a bad idea, and thus separated srun and sbatch. In Flux at least we can get subcommands easily, so we should not fall into the same traps as before.

We don't think we want the user to play with -N -n -c -g trick that I used in flux-tree to ensure they will run one broker per node to enable this.

Yeah, IIRC (and I'm probably forgetting something), the best we came up with here was a per-resource: key I think (so called late-binding of tasks). Strangely, I cannot find our discussion of this. Probably fumbling the search feature today.

An idea that wouldn't require a bump in the jobspec version would be to add a flag in the attributes section which would tell the job-shell to only launch a single task, overriding any task slot count in _R_. This would be kind of a kludge, but would move us forward much more quickly I think. The flag could also be used to differentiate "batch" jobs from other jobs, if that ends up being necessary.

Here's another difference between batch job (nested instance) submission and normal job submission -- The utility will need an easy way to pass options down to flux-start or flux broker

This would more easily be supported via a specific batch job submission command rather than having to check for a set of commandline flags that only apply if --nest is used.

BTW, I think we could have two commands, a flux batch (or something) that gives a more traditional batch job submission interface, and a flux nest (or something) that is oriented more to users that are aware they are nesting instances and not just launching a more traditional batch request.

@grondo:

Since you have previous experience from your Slurm development experience, your knowledge should take precedence over mine.

BTW, I think we could have two commands, a flux batch (or something) that gives a more traditional batch job submission interface, and a flux nest (or something) that is oriented more to users that are aware they are nesting instances and not just launching a more traditional batch request.

IMHO, there is a benefit of providing a single command with consistent nesting behavior; rather than multiple commands. flux mini batch or ultimately flux batch works for me.

My current workflow scheduling investigation is "unification of batch and workflow scheduling". Over time, I am more or less convinced that that the main reason that we have so many ad hoc workflow schedulers is because of the lack of this approach. Flux-UQP co-design team is in the process of quantifying the benefits of this unique approach.

We want users to use their batchscripts with no or minor modifications at every level of scheduling hierarchy. (as system-level batch jobs if this is for example 3D multi-year campaign; as nest-level batch jobs if users require them to run them in high throughput mode -- say parameter studies). Making that as seamless as possible is the key to pull this off. As flux mini batch will invoke a new flux instance with appropriate properties (one instance per node), this is pretty close to where we want to be. BTW, testing such was one of the reasons behind my suggestion to bring UQP team to fluke... But like as @garlick said this can be done later too.

I don't feel like I've accepted that spawning a hierarchy of instances _automatically_ for one user's batch submission is now, or will ever be a good idea. Perhaps I need convincing? But not here - let's keep this focused on the gap in providing a traditional batch interface.

IMHO we need to walk before we can run anyway.

IMHO we need to walk before we can run anyway.

Heh, I read this at first as "we need to walk before we can run away"

I don't feel like I've accepted that spawning a hierarchy of instances automatically for one user's batch submission is now, or will ever be a good idea. Perhaps I need convincing?

Yeah, we probably should set some study plans. Medium sized system with one level system instance only vs. automatic creating of instance and see how the system instance fare with respect to IO and KVS...

We want users to use their batchscripts with no or minor modifications at every level of scheduling hierarchy.

I agree completely!

I think I was only pointing out that it might be a mistake from a UX perspective to add some option to flux mini submit that tries to turn the command into two commands; one that launches its argument in parallel and one that runs its argument once as the initial program of a new instance. We have found in the past that maintaining a utility such as this is a bit of a pain, and the interface itself ends up being confusing for users.

I think I was only pointing out that it might be a mistake from a UX perspective to add some option to flux mini submit that tries to turn the command into two commands; one that launches its argument in parallel and one that runs its argument once as the initial program of a new instance. We have found in the past that maintaining a utility such as this is a bit of a pain, and the interface itself ends up being confusing for users.

Yes. I am convinced! :-)

Yeah, we probably should set some study plans. Medium sized system with one level system instance only vs. automatic creating of instance and see how the system instance fare with respect to IO and KVS...

To be clear - I'm not arguing against starting batch jobs in a new instance. I'm all in on that one.

To be clear - I'm not arguing against starting batch jobs in a new instance. I'm all in on that one.

I'm sorry. I misunderstood you then.

I don't feel like I've accepted that spawning a hierarchy of instances automatically for one user's batch submission is now, or will ever be a good idea.

Could you elaborate what you mean by "spawning a hierarchy of instances automatically for one user's batch submission" for me?

Oh, did you mean one _broker_ per node here?

As flux mini batch will invoke a new flux instance with appropriate properties (one instance per node)

I assumed you were talking about spawning a hierarchy of instances per batch job, like flux tree.

Sorry for misunderstanding!

I assumed you were talking about spawning a hierarchy of instances per batch job, like flux tree.

Ah...flux-tree is one utility that a workflow can use to gain the proper level of parallelism in scheduling, but not essential to the unified model.

Oh, did you mean one broker per node here?

Yes. I think this is essential and the interface consistency whereby user can use the same submit command (flux mini batch as proposed) at every level.

Summarizing, what I think is unanimous, here's a proposal with requirements

  • Addition of a new, batch-specific command (let's call it flux mini batch for now)
  • flux mini batch runs a script provided on the command line as the initial program of a new instance
  • flux mini batch submits a jobspec which results in one flux-broker run per node by default
  • flux mini batch should provide a way to select resources similar to other flux-mini commands, but without the use of -n, --ntasks
  • flux mini batch will provide a way to pass broker options
  • TBD: flux mini batch may support copying the submitted batch script into jobspec so that subsequent changes to script on disk won't affect job
  • TBD: how to specify in jobspec that only one command per node is to be launched by job-shell

flux mini batch runs a script provided on the command line as the initial program of a new instance

Do we want to support "implicit script" whereby commands are specified through command line? flux mini batch flux run my job?

flux mini batch should provide a way to select resources similar to other flux-mini commands, but without the use of -n, --ntasks

Can this still be supported and flux mini batch just passes the value through to the script? If the script doesn't specify -n to flux mini submit or flux mini run, the pass through value will be used instead. I think this has some benefit at the workflow level (Nuclear Naval Lab case...)

Otherwise the proposal LGTM. Thanks @grondo.

@grondo for what it is worth, I really like this proposal. As someone who regularly creates nested instances via flux mini run ... flux start ... I think it would be a significant improvement. It also covers the reason I brought up #2647, which I think has some of the per-resource discussion you were looking for.

Do we want to support "implicit script" whereby commands are specified through command line? flux mini batch flux run my job?

Yeah, sbatch has a --wrap option where the "batch script" becomes

#!/bin/sh
flux run myjob

Maybe something similar?

Can this still be supported and flux mini batch just passes the value through to the script? If the script doesn't specify -n to flux mini submit or flux mini run, the pass through value will be used instead. I think this has some benefit at the workflow level (Nuclear Naval Lab case...)

I was thinking to avoid this kind of inheritance since it can get fussy.
Traditional resource managers require this kludge because they don't provide a nice interface to query the resource set available and thus size jobs appropriately.

However, you do point out a good requirement for the batch script interface that I forgot above.

A lot of batch scripts are reused for different allocation sizes and have the equivalent of what you are suggesting above (flux run without specifying ntasks/nnodes), relying on inheritance from the environment.

I wonder if there is some cleaner way we can support this requirement though?

It also covers the reason I brought up #2647, which I think has some of the per-resource discussion you were looking for.

Ah, thanks for your insight @jameshcorbett!

Maybe we would be able to add a --per-resource option or similar to flux mini run and flux mini submit if we get it kind of working for the flux mini batch case.. Hmm.

@grondo if flux batch existed, I wouldn't have any reason to ask for a --per-resource option to flux mini run. It would still be welcome, of course, though.

A lot of batch scripts are reused for different allocation sizes and have the equivalent of what you are suggesting above (flux run without specifying ntasks/nnodes), relying on inheritance from the environment.

I wonder if there is some cleaner way we can support this requirement though?

It's not exactly the same thing, but I liked this proposal and my guess is that it would cover most of the use-cases.

Maybe something similar?

Sounds good to me. Don't need this right away though.

However, you do point out a good requirement for the batch script interface that I forgot above.

A lot of batch scripts are reused for different allocation sizes and have the equivalent of what you are suggesting above (flux run without specifying ntasks/nnodes), relying on inheritance from the environment.

I wonder if there is some cleaner way we can support this requirement though?

Yes, let's think about this a bit more then.
The case I was thinking about was more for a workflow tool wanting to compose their "execution recipe" through neseting (NNL case) using this same interface. When we have some more clever proposal, let's see how it fares for both the traditional case and workflow composition case.

I can see why inheritance can be confusing.

But isn't that also the case for all other options? Do we want those resource shape options inherited more than one level?

Do we want those resource shape options inherited more than one level?

I would not think so, though I admit I might not be seeing all the use cases.

I think a batch script/workflow program/other flux use case should not assume it will be at any given level in the tree, and instead we should offer a rich discovery interface that allows users to determine the size and parameters of the instance in which they are running. Assuming that, when nesting, things never "get bigger the further you go in", then parameters that applied at the outer levels won't apply at the inner, and thus automatically inheriting via many environment variables seems like a dangerous approach.

Yeah, IIRC (and I'm probably forgetting something), the best we came up with here was a per-resource: key I think (so called late-binding of tasks). Strangely, I cannot find our discussion of this. Probably fumbling the search feature today.

FWIW, I think this is the discussion you were looking for: https://github.com/flux-framework/rfc/issues/150

Echo'ing everyone else that I think flux mini batch will be a nice usability improvement :)

@jameshcorbett: A PR is up with a prototype flux mini batch interface if you'd like to comment: #2962

Please also see this request from Naval Nuclear Lab. I will re-forward the user's write-up to our mailing list.

@dongahn, I went back and re-read the use case from Naval Nuclear Lab.
As I understand it, this use case could easily be covered by something like @trws' original capacitor. Since the described input file would need an interpreter anyway, the implementation could look for recursive invocations of itself, and wrap the command in flux mini alloc instead of flux mini run.

Here's a silly proof of concept flux slurp interpreter that reads task files of the form:

NCORES SINGLENODE COMMAND [ARGS...]

Each line is translated to flux mini run -n NCORES COMMAND.. unless COMMAND is flux slurp in which case flux mini alloc -n NCORES flux slurp ARGS.. is used.

I think this satisfies the use case, and shows the usefulness of unified flux-mini utilities.

#!/usr/bin/env python3

import asyncio
import sys
import fileinput

def command_from_args(ncores, singlenode, *args):
    command = ["flux", "mini"]
    if args[0] == "flux" and args[1] == "slurp":
        command.append("alloc")
    else:
        command.append("run")
    if singlenode == 1:
        command.append("-N1")
    command.extend([f"-n{ncores}", *args])
    return command

async def runcmd(command):
    proc = await asyncio.create_subprocess_exec(
        *command,
        stdin=asyncio.subprocess.DEVNULL
    )
    print("started {}".format(" ".join(command)))
    result = await proc.wait()
    print("completed {}".format(" ".join(command)))
    return result

def slurp():
    tasks = []
    for line in fileinput.input(sys.argv[1:]):
        line = line.strip().split("#", 1)[0]
        if not line:
            continue
        tasks.append(runcmd(command_from_args(*line.split())))
    return tasks

loop = asyncio.get_event_loop()
tasks = slurp()
loop.run_until_complete(asyncio.gather(*tasks))
茠(s=4,d=0,builddir) grondo@asp:~/git/flux-core.git$ cat a.slurp
6 0 echo a: job 1
4 0 sleep 3
4 0 echo a: job 2
4 0 echo a: job 3
6 0 flux slurp b.slurp
4 1 echo a: job 4
茠(s=4,d=0,builddir) grondo@asp:~/git/flux-core.git$ cat b.slurp
4 0 echo b: job 1
1 0 echo b: job 2
2 0 echo b: job 3
茠(s=4,d=0,builddir) grondo@asp:~/git/flux-core.git$ flux slurp a.slurp
started flux mini run -n4 echo a: job 4
started flux mini run -n4 sleep 3
started flux mini run -n6 echo a: job 1
started flux mini run -n4 echo a: job 3
started flux mini run -n4 echo a: job 2
started flux mini alloc -n6 flux slurp b.slurp
a: job 1
a: job 1
a: job 1
a: job 1
a: job 1
a: job 1
a: job 4
a: job 4
a: job 4
a: job 4
completed flux mini run -n4 echo a: job 4
completed flux mini run -n6 echo a: job 1
a: job 2
a: job 2
a: job 2
a: job 2
completed flux mini run -n4 echo a: job 2
a: job 3
a: job 3
a: job 3
a: job 3
completed flux mini run -n4 echo a: job 3
b: job 3
b: job 3
b: job 1
b: job 1
b: job 1
b: job 1
b: job 2
started flux mini run -n4 echo b: job 1
started flux mini run -n2 echo b: job 3
started flux mini run -n1 echo b: job 2
completed flux mini run -n2 echo b: job 3
completed flux mini run -n4 echo b: job 1
completed flux mini run -n1 echo b: job 2
completed flux mini alloc -n6 flux slurp b.slurp
completed flux mini run -n4 sleep 3

@dongahn, I went back and re-read the use case from Naval Nuclear Lab.
As I understand it, this use case could easily be covered by something like @trws' original capacitor. Since the described input file would need an interpreter anyway, the implementation could look for recursive invocations of itself, and wrap the command in flux mini alloc instead of flux mini run.

Very cool, @grondo. Yeah I was thinking along this line but seeing this in real code is worth a thousand words.

Initially I was thinking to use a single command like flux mini submit for both regular tasks and nested tasks as one "unified" interface. But because of a side effect of running multiple brokers per node, I think your approach would best.

I will look at your proof of concept and comment if any. But since something like this will be one of the common WF use cases that only Flux's hierarchical approach can uniquely enable, I will plan to add this to our workflow use case.

Also, seamless hierarchal composition techniques are the core for scheduler unification between batch and workflow schedulers that @SteVwonder, @jameshcorbett, Davie Domyancic and I have been working towards. This is a BIG proof of concept!

A huge improvement for something like this would be to submit jobspec directly via the Python API and use job.wait to asynchronously wait for completion of "slurped" tasks instead of invoking subprocesses as is done in the POC.

Fixing #2346 would be a huge benefit here, since we could also (easily) watch job output eventlog from Python. (or we need to abstract that into the Python API)

Was this page helpful?
0 / 5 - 0 ratings