Drake: Troubleshooting Makefile parallelism for SLURM

Created on 28 Oct 2017 · 26Comments · Source: ropensci/drake

As described here: https://github.com/wlandau-lilly/drake/issues/115

I am trying to get Makefile paralellism working using slurm.

First one I get the error Makefile:9: *** missing separator. Stop.:

library(drake)

simulate <- function(n){
  rnorm(n)
  print("simulating 3")
  Sys.sleep(20)
}

my_plan <- workplan(
  primer1 = simulate(20),
  primer2 = simulate(10),
  data1 = primer1 + 1,
  data2 = primer2 + 2,
  result = mean(c(data1, data2))
)

make(
  plan = my_plan,
  targets = c("data1", "data2"), # `primer` is built too
  parallelism = "Makefile",
  jobs = 2,
  prepend = c(
    "#!/bin/bash",
    "#SBATCH -J testing", 
    "#SBATCH -A landcare00063",
    "#SBATCH --time=1:00:00",
    "#SBATCH --cpus-per-task=1",
    "#SBATCH --begin=now",
    "#SBATCH --mem=1G", 
    "#SBATCH -C sb",
    "module load R"
  ),
  recipe_command = "srun Rscript -e 'R_RECIPE'"
)
Makefile:9: *** missing separator.  Stop.

I can't seem to find the makefile itself to see what it's actually producing. Is there a way to produce the makefile only without running it?

help or input priority

Source

kendonB

👍1

All 26 comments

make(my_plan, parallelism = "Makefile", args = c("--touch", "--silent"))

I should put that one in the parallelism vignette, thanks for another spot on idea.

Unfortunately, the Makefile is not really human readable. I am using dummy timestamp files to hack Make into only submitting the necessary jobs. It's @krlmlr's idea from wlandau/parallelRemake#4.

For the purposes of this thread, did you only want a better way to troubleshoot, or do you also want a Makefile configured for SLURM? Because I want that too, but I cannot help with that at the moment.

wlandau-lilly on 28 Oct 2017

Sorry, to clarify, I'm trying to troubleshoot the error Makefile:9: *** missing separator. Stop. with the ultimate goal of configuring my makefile for submitting slurm jobs.

I've added the arguments and my example ran; but where is the makefile? The .makefile folder is empty after I run this.

kendonB on 28 Oct 2017

After a passable night's sleep, I think I know what the problem is. GNU Make thinks module load R is a recipe, and it think you need a tab to indent it. You may be able to fool it with something like

make(
  plan = my_plan,
  targets = c("data1", "data2"), # `primer` is built too
  parallelism = "Makefile",
  jobs = 2,
  prepend = c(
    "#!/bin/bash",
    "#SBATCH -J testing", 
    "#SBATCH -A landcare00063",
    "#SBATCH --time=1:00:00",
    "#SBATCH --cpus-per-task=1",
    "#SBATCH --begin=now",
    "#SBATCH --mem=1G", 
    "#SBATCH -C sb",
    "TMP=`module load R`"
  ),
  recipe_command = "srun Rscript -e 'R_RECIPE'"
)

But I think you need a special shell.sh, as described here. You can generate a starter with shell_file(). Your shell.sh should probably look something like this.

#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
shift
echo "module load R; $*" | srun

And then you call make(..., prepend=SHELL=./shell.sh)`.

I have not tried this, but an alternative might be the regular shell.sh file written by shell_sh():

#!/bin/bash
shift
echo "module load R; $*" | srun

with

make(
  plan = my_plan,
  targets = c("data1", "data2"), # `primer` is built too
  parallelism = "Makefile",
  jobs = 2,
  prepend = c(
    "#!/bin/bash",
    "#SBATCH -J testing", 
    "#SBATCH -A landcare00063",
    "#SBATCH --time=1:00:00",
    "#SBATCH --cpus-per-task=1",
    "#SBATCH --begin=now",
    "#SBATCH --mem=1G", 
    "#SBATCH -C sb",
    "SHELL=./shell.sh"
  ),
  recipe_command = "srun Rscript -e 'R_RECIPE'"
)

...and you really don't see a Makefile? That's odd, it should write to your working directory at the time you call make(). I don't see how you could get Makefile:9: *** missing separator. Stop. otherwise.

wlandau-lilly on 28 Oct 2017

By the way, if you get it working, I have colleagues from grad school who would really benefit. It would be a great help if you share your solution, maybe here in the parallelism vignette, maybe in an example like Makefile-cluster.

wlandau-lilly on 28 Oct 2017

Alright, we're progressing! Found the makefile; thanks!

I tried creating a shell.sh with this in it:

#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
shift
echo "module load R; $*" | srun

And I get this:

make(
  plan = my_plan,
  targets = c("data1", "data2"), # `primer` is built too
  parallelism = "Makefile",
  jobs = 2,
  recipe_command = "srun Rscript -e 'R_RECIPE'", 
  prepend="SHELL=./shell.sh"
)
check 3 items: print, rnorm, Sys.sleep
import print
import rnorm
import Sys.sleep
check 1 item: simulate
import simulate
srun Rscript -e 'drake::mk(target = "primer1", cache_path = "<wd>/.drake")'
srun Rscript -e 'drake::mk(target = "primer2", cache_path = "<wd>/.drake")'
srun: fatal: No command given to execute.
srun: fatal: No command given to execute.
make: *** [<wd>/.drake/ts/3c356dca4040e3c4] Error 1
make: *** Waiting for unfinished jobs....
make: *** [<wd>/.drake/ts/b3a79b8e12e4bcd5] Error 1

kendonB on 28 Oct 2017

Maybe recipe_command = "srun bash -c Rscript -e 'R_RECIPE'"? I wish I could test it myself.

FWIW, this approach dates back to this blog post. My colleagues and I were using that approach in grad school, and it was super convenient at the time. But then they told me it had apparently stopped working, and by then I had graduated and could no longer access the cluster.

wlandau-lilly on 29 Oct 2017

No shell file required, but might not work: make(..., prepend = c("SHELL=srun", ".SHELLFLAGS= -n1 -n1 bash -c")).

wlandau-lilly on 29 Oct 2017

With ``recipe_command = "srun bash -c Rscript -e 'R_RECIPE'"```, I get the same error as above.

With make(..., prepend = c("SHELL=srun", ".SHELLFLAGS= -n1 -n1 bash -c")), and adding my configuration lines to prepend, I get missing separator error again.

This might help: https://mussolblog.wordpress.com/2013/07/17/setting-up-a-testing-slurm-cluster/

kendonB on 29 Oct 2017

That's unfortunate. If srun accepts commands, there should be a way to tell it that Rscript is one too. I would very much prefer this solution. If you can reproduce it with a single srun Rscript -e 'print(1234)', it would be great if you would push this to Stack Overflow.

Do you still have a module load R line in your prepend? I think that is what is behind the missing separator. But come to think of it, module load R won't be executed on the nodes at all if it is just a line prepended to the Makefile. In fact, all those prepend lines really belong to each individual job submission, ideally direct arguments to srun via the recipe_command. They should have no effect as part of the Makefile.

Thank you for sending the Vagrant example. Unfortunately, copying over my munge key timed out.

sudo scp /etc/munge/munge.key vagrant@server:/home/vagrant/
ssh: connect to host server port 22: Connection timed out
lost connection

With the trouble I'm having installing job schedulers, maybe learning Docker or Vagrant is the next step in all this.

wlandau-lilly on 29 Oct 2017

I can't run srun Rscript -e 'print(1234)' without wrapping it in a shell script, if that's what you were asking for.

It might help to make sure we're on the same page for I usually do this.

I write a testing.sl file with:

#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
module load R
srun Rscript workflow.R

Then submit the job using:

sbatch testing.sl

The sbatch command reads the configuration commands and submits the srun(s) to the scheduler.

kendonB on 29 Oct 2017

Can testing.sl accept arguments like an ordinary shell script? Maybe something like:

#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
module load R
srun Rscript -e '$1'

with

make(
  your_plan,
  parallelism = "Makefile",
  jobs = 8,
  recipe_command = "sbatch testing.sl 'R_RECIPE'"
)

wlandau-lilly on 29 Oct 2017

On second thought, rather than deal with shell scripts with arguments, it may be better to go back to your earlier attempt with shell.sh and replace srun with sbatch (leaving recipe_command alone). You could try moving the #SBATCH parameters inline with sbatch, or at the top of the Makefile in case I was wrong about that bit.

wlandau-lilly on 29 Oct 2017

I tried your second most recent suggestion and it successfully submits jobs. However, they all failed with the following error:

Error: unexpected '$' in "$"
Execution halted
srun: error: compute-d1-020: task 0: Exited with exit code 1

All 5 jobs got submitted at once as well, so the solution didn't seem to obey the dependency rules.

I'm not sure I understand your most recent suggestion.

kendonB on 29 Oct 2017

If all 5 jobs got submitted at once, that makes me think we should always be using srun (blocking) rather than sbatch (non-blocking). And as for the $ error, that's probably a minor syntax mistake.

So maybe this?

# testing.sl
#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
module load R
`$1`

# in R
make(
  your_plan,
  parallelism = "Makefile",
  jobs = 8,
  recipe_command = "srun testing.sl 'R_RECIPE'"
)

The later suggestion probably won't work anyway.

wlandau-lilly on 29 Oct 2017

It doesn't seem to register the account when running with srun.

srun testing.sl 'drake::mk(target = "primer1", cache_path = "/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.drake")'
srun testing.sl 'drake::mk(target = "primer2", cache_path = "/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.drake")'
srun: error: Invalid account used
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
srun: error: Invalid account used
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
make: *** [/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.drake/ts/b3a79b8e12e4bcd5] Error 1
make: *** Waiting for unfinished jobs....
make: *** [/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.drake/ts/3c356dca4040e3c4] Error

Should it not include multiple srun commands within the .sl file and run with sbatch at the terminal?
See Running Multiple Parallel Jobs Sequentially here: http://www.nersc.gov/users/computational-systems/cori/running-jobs/example-batch-scripts/

kendonB on 30 Oct 2017

Then maybe SLURM doesn't see the #SBATCH args. Maybe revert back and try different ways to quote $1? Or not quote $1 at all?

# testing.sl
#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
module load R
srun Rscript -e '\"$1\"' # Maybe play around here.

# in R
make(
  your_plan,
  parallelism = "Makefile",
  jobs = 8,
  recipe_command = "sbatch testing.sl 'R_RECIPE'"
)

wlandau-lilly on 30 Oct 2017

The above runs but again it submits all 5 jobs at once again. I tried a bunch of permutations of '\"$1\"' and nothing worked. I think at this point the best way forward is for you to get a test SLURM scheduler working for yourself as the back and forth is quite inefficient. I'm sorry I wasn't able to get this all the way to the end!

kendonB on 30 Oct 2017

Yeah, that sounds like the best plan. I really am trying:

Resources that I tried but could not get to work:

plus several forums. I still get

$ slurmd
slurmd: fatal: Frontend not configured correctly in slurm.conf.  See man slurm.conf look for frontendname.

When I do get SLURM, probably the first thing I will do is test the --wait flag for sbatch, as in recipe_command = "sbatch --wait testing.sl 'R_RECIPE'".

wlandau-lilly on 30 Oct 2017

👍1

I really need all the help I can get to get SLURM working on Ubuntu 16.04.

wlandau-lilly on 30 Oct 2017

👍1

As I mentioned in #115, I got SLURM to run on a Debian VM. (I followed this guide, substituting in my own user name instead of slurm, and setting both the master and node names to Debian64 (hostname of the VM)). The following worked perfectly for me.

library(drake)
load_basic_example()
make(
  my_plan,
  parallelism = "Makefile",
  jobs = 2,
  prepend = c(
    "SHELL=srun",
    ".SHELLFLAGS=-N1 -n1 bash -c"
  )
)

I am simultaneously stoked that something this simple actually worked and bothered that I cannot reproduce everyone's errors. I thought it might be because I listed myself in /etc/sudoers, but then it still worked when I took myself off and tried again. It could be something different about the real cluster environment.

wlandau-lilly on 30 Oct 2017

How would one add the SBATCH configuration in the above?

kendonB on 30 Oct 2017

Command line argument to srun should cover it here. For example, I can still set the job name (thought it's silly to have so many jobs with the same name).

make(
  my_plan,
  parallelism = "Makefile",
  jobs = 2,
  prepend = c(
    "SHELL=srun",
    ".SHELLFLAGS=-N1 -n1 bash -c"
  )
)

squeue showed me that they're all named testjob. But anyway, I pushed it to Stack Overflow.

wlandau-lilly on 30 Oct 2017

@kendonB Please see the response on Stack Overflow. srun should be able to meet our needs.

wlandau-lilly on 30 Oct 2017

Not surprisingly, SLURM arrays are not an option with this approach. The new rslurm package would cover this as a separate special backend. Given the other bottlenecks from drake itself, accommodating this special case may or may not be worth the efficiency gains.

wlandau-lilly on 30 Oct 2017

@kendonB, from what you learned solving #115, do you think #117 could be solved the same way? Is it even worth the time now that you have #115? If you no longer need #117 to work, please let me know. Makefiles with srun seem to work for me, so I would prefer to either troubleshoot more with you or close the issue.

wlandau-lilly on 31 Oct 2017

I had that sb config flag on this one, so this thread would have been a separate problem. Since #115 seems to be working for me, let's just close this issue until someone says they have the same problem.

kendonB on 31 Oct 2017

👍1

Was this page helpful?

0 / 5 - 0 ratings