Chapel: Write a low-level slurm interface in Chapel

Created on 26 Jun 2018 · 7Comments · Source: chapel-lang/chapel

As a Chapel user on a slurm-managed HPC system, I want to allocate a set of nodes, query information about those nodes, and launch multilocale programs onto those nodes from within a Chapel program.

This task will involve implementing a simple proof-of-concept Chapel module for interfacing with slurm via the Spawn module.

The final design of this module would be determined in a follow-up task.

Commands included initially would be:

proc salloc() { .. } // Returns array of nodeIDs allocated
proc srun() { .. } // Takes nodeIDs as arguments
proc scancel() { .. }
proc squeue() { .. }

This task meets the needs of a particular user, but also lays the groundwork for us to potentially implement the Chapel launchers in Chapel in the future.

Source

ben-albrecht

👍1

All 7 comments

I assume the Chapel program using this module would need to run on the login node (unless #10042 proves otherwise).

ben-albrecht on 26 Jun 2018

also lays the groundwork for us to potentially implement the Chapel launchers in Chapel in the future.

I don't find this concept all that compelling. The proposal to rewrite the launchers in Python had the goal of making it simpler for sysadmins and customer sites to update launchers in the event that they had their own wrappers around standard technologies which required customization in our launcher code to use. Writing the launchers in Chapel (a) keeps something that seems "script-like" in a compiled code base, which adds overhead similar to the C code today, (b) requires the sysadmins to use a language they're probably not familiar with; and (c) doesn't make use of any Chapel-specific features (parallelism, locality, etc.). While I buy the "a chance to eat our own dogfood" argument to an extent, I think there are probably better opportunities than this one in terms of fit.

bradcray on 26 Jun 2018

I think another use case is running large experiments.

Everytime I need to do some experiment I end up writing set of python scripts that submits a ton of jobs, monitors them and collects the results (and plots them sometimes). Such scripts generally end up using python threads to do submission/monitoring/collection concurrently. Having such a module in Chapel (along with faster compilation times) would probably lead me to choose Chapel instead of Python.

Depending upon the urgency and consensus on need for something like this, I can create the basic proof-of-concept and extend it as I find time to work on it.

e-kayrakli on 26 Jun 2018

👍1

Thinking about this module a bit more today, I think that the ideal would be not to support a Slurm module but to support a NodeControl module (though I'm not crazy about that name). By analogy, it's been argued by @ben-albrecht and others that users don't want to think about LAPACK routines, but about mathematical operations, hence the LinearAlgebra package. In the same way, it seems like the dream would be to be able to say "get me some nodes", "tell me about those nodes", "start this program on these nodes", etc. rather than just exposing the slurm commands as routines. Then, the routines could map down to slurm, aprun, whatever.

Of course, it makes sense to have a slurm implementation before providing the abstract implementation just as we supported LAPACK before adding LinearAlgebra, but thinking ahead...

bradcray on 28 Jun 2018

👍1

I assume the Chapel program using this module would need to run on the login node (unless #10042 proves otherwise).

Thinking about this and #10042, I am just not sure if you can salloc and srun within the same application. AFAIK, salloc starts a bash session on one of the compute nodes (called bash node/shell node or something like that) from which you can srun. I really dunno off-hand whether you can do it within an application. So with salloc followed by srun type of workflow, I believe srun doesn't necessarily come from the login node. All that being said, sbatch workflow is much clearer and should start from the login node.

All these can easily be tested, and I will.

e-kayrakli on 28 Jun 2018

Maybe this is going OO where it's not really needed, but would it make more sense for salloc to return an object (e.g., SlurmAllocation) with methods like run and cancel?

class SlurmJobStep {
  proc cancel() { .. }
  proc state() { .. }
}

class SlurmAllocation {
  proc run(): SlurmJobStep { .. }
  proc cancel() { .. }
}

proc salloc(): SlurmAllocation { .. }
proc srun(): SlurmJobStep { .. }
proc squeue() { .. }