Drake: Troubleshooting Makefile parallelism for SLURM

Created on 28 Oct 2017  路  26Comments  路  Source: ropensci/drake

As described here: https://github.com/wlandau-lilly/drake/issues/115

I am trying to get Makefile paralellism working using slurm.

First one I get the error Makefile:9: *** missing separator. Stop.:

library(drake)

simulate <- function(n){
  rnorm(n)
  print("simulating 3")
  Sys.sleep(20)
}

my_plan <- workplan(
  primer1 = simulate(20),
  primer2 = simulate(10),
  data1 = primer1 + 1,
  data2 = primer2 + 2,
  result = mean(c(data1, data2))
)

make(
  plan = my_plan,
  targets = c("data1", "data2"), # `primer` is built too
  parallelism = "Makefile",
  jobs = 2,
  prepend = c(
    "#!/bin/bash",
    "#SBATCH -J testing", 
    "#SBATCH -A landcare00063",
    "#SBATCH --time=1:00:00",
    "#SBATCH --cpus-per-task=1",
    "#SBATCH --begin=now",
    "#SBATCH --mem=1G", 
    "#SBATCH -C sb",
    "module load R"
  ),
  recipe_command = "srun Rscript -e 'R_RECIPE'"
)
Makefile:9: *** missing separator.  Stop.

I can't seem to find the makefile itself to see what it's actually producing. Is there a way to produce the makefile only without running it?

help or input priority

All 26 comments

make(my_plan, parallelism = "Makefile", args = c("--touch", "--silent"))

I should put that one in the parallelism vignette, thanks for another spot on idea.

Unfortunately, the Makefile is not really human readable. I am using dummy timestamp files to hack Make into only submitting the necessary jobs. It's @krlmlr's idea from wlandau/parallelRemake#4.

For the purposes of this thread, did you only want a better way to troubleshoot, or do you also want a Makefile configured for SLURM? Because I want that too, but I cannot help with that at the moment.

Sorry, to clarify, I'm trying to troubleshoot the error Makefile:9: *** missing separator. Stop. with the ultimate goal of configuring my makefile for submitting slurm jobs.

I've added the arguments and my example ran; but where is the makefile? The .makefile folder is empty after I run this.

After a passable night's sleep, I think I know what the problem is. GNU Make thinks module load R is a recipe, and it think you need a tab to indent it. You may be able to fool it with something like

make(
  plan = my_plan,
  targets = c("data1", "data2"), # `primer` is built too
  parallelism = "Makefile",
  jobs = 2,
  prepend = c(
    "#!/bin/bash",
    "#SBATCH -J testing", 
    "#SBATCH -A landcare00063",
    "#SBATCH --time=1:00:00",
    "#SBATCH --cpus-per-task=1",
    "#SBATCH --begin=now",
    "#SBATCH --mem=1G", 
    "#SBATCH -C sb",
    "TMP=`module load R`"
  ),
  recipe_command = "srun Rscript -e 'R_RECIPE'"
)

But I think you need a special shell.sh, as described here. You can generate a starter with shell_file(). Your shell.sh should probably look something like this.

#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
shift
echo "module load R; $*" | srun

And then you call make(..., prepend=SHELL=./shell.sh)`.

I have not tried this, but an alternative might be the regular shell.sh file written by shell_sh():

#!/bin/bash
shift
echo "module load R; $*" | srun

with

make(
  plan = my_plan,
  targets = c("data1", "data2"), # `primer` is built too
  parallelism = "Makefile",
  jobs = 2,
  prepend = c(
    "#!/bin/bash",
    "#SBATCH -J testing", 
    "#SBATCH -A landcare00063",
    "#SBATCH --time=1:00:00",
    "#SBATCH --cpus-per-task=1",
    "#SBATCH --begin=now",
    "#SBATCH --mem=1G", 
    "#SBATCH -C sb",
    "SHELL=./shell.sh"
  ),
  recipe_command = "srun Rscript -e 'R_RECIPE'"
)

...and you really don't see a Makefile? That's odd, it should write to your working directory at the time you call make(). I don't see how you could get Makefile:9: *** missing separator. Stop. otherwise.

By the way, if you get it working, I have colleagues from grad school who would really benefit. It would be a great help if you share your solution, maybe here in the parallelism vignette, maybe in an example like Makefile-cluster.

Alright, we're progressing! Found the makefile; thanks!

I tried creating a shell.sh with this in it:

#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
shift
echo "module load R; $*" | srun

And I get this:

make(
  plan = my_plan,
  targets = c("data1", "data2"), # `primer` is built too
  parallelism = "Makefile",
  jobs = 2,
  recipe_command = "srun Rscript -e 'R_RECIPE'", 
  prepend="SHELL=./shell.sh"
)
check 3 items: print, rnorm, Sys.sleep
import print
import rnorm
import Sys.sleep
check 1 item: simulate
import simulate
srun Rscript -e 'drake::mk(target = "primer1", cache_path = "<wd>/.drake")'
srun Rscript -e 'drake::mk(target = "primer2", cache_path = "<wd>/.drake")'
srun: fatal: No command given to execute.
srun: fatal: No command given to execute.
make: *** [<wd>/.drake/ts/3c356dca4040e3c4] Error 1
make: *** Waiting for unfinished jobs....
make: *** [<wd>/.drake/ts/b3a79b8e12e4bcd5] Error 1

Maybe recipe_command = "srun bash -c Rscript -e 'R_RECIPE'"? I wish I could test it myself.

FWIW, this approach dates back to this blog post. My colleagues and I were using that approach in grad school, and it was super convenient at the time. But then they told me it had apparently stopped working, and by then I had graduated and could no longer access the cluster.

No shell file required, but might not work: make(..., prepend = c("SHELL=srun", ".SHELLFLAGS= -n1 -n1 bash -c")).

With ``recipe_command = "srun bash -c Rscript -e 'R_RECIPE'"```, I get the same error as above.

With make(..., prepend = c("SHELL=srun", ".SHELLFLAGS= -n1 -n1 bash -c")), and adding my configuration lines to prepend, I get missing separator error again.

This might help: https://mussolblog.wordpress.com/2013/07/17/setting-up-a-testing-slurm-cluster/

That's unfortunate. If srun accepts commands, there should be a way to tell it that Rscript is one too. I would very much prefer this solution. If you can reproduce it with a single srun Rscript -e 'print(1234)', it would be great if you would push this to Stack Overflow.

Do you still have a module load R line in your prepend? I think that is what is behind the missing separator. But come to think of it, module load R won't be executed on the nodes at all if it is just a line prepended to the Makefile. In fact, all those prepend lines really belong to each individual job submission, ideally direct arguments to srun via the recipe_command. They should have no effect as part of the Makefile.

Thank you for sending the Vagrant example. Unfortunately, copying over my munge key timed out.

sudo scp /etc/munge/munge.key vagrant@server:/home/vagrant/
ssh: connect to host server port 22: Connection timed out
lost connection

With the trouble I'm having installing job schedulers, maybe learning Docker or Vagrant is the next step in all this.

I can't run srun Rscript -e 'print(1234)' without wrapping it in a shell script, if that's what you were asking for.

It might help to make sure we're on the same page for I usually do this.

I write a testing.sl file with:

#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
module load R
srun Rscript workflow.R

Then submit the job using:

sbatch testing.sl

The sbatch command reads the configuration commands and submits the srun(s) to the scheduler.

Can testing.sl accept arguments like an ordinary shell script? Maybe something like:

#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
module load R
srun Rscript -e '$1'

with

make(
  your_plan,
  parallelism = "Makefile",
  jobs = 8,
  recipe_command = "sbatch testing.sl 'R_RECIPE'"
)

On second thought, rather than deal with shell scripts with arguments, it may be better to go back to your earlier attempt with shell.sh and replace srun with sbatch (leaving recipe_command alone). You could try moving the #SBATCH parameters inline with sbatch, or at the top of the Makefile in case I was wrong about that bit.

I tried your second most recent suggestion and it successfully submits jobs. However, they all failed with the following error:

Error: unexpected '$' in "$"
Execution halted
srun: error: compute-d1-020: task 0: Exited with exit code 1

All 5 jobs got submitted at once as well, so the solution didn't seem to obey the dependency rules.

I'm not sure I understand your most recent suggestion.

If all 5 jobs got submitted at once, that makes me think we should always be using srun (blocking) rather than sbatch (non-blocking). And as for the $ error, that's probably a minor syntax mistake.

So maybe this?

# testing.sl
#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
module load R
`$1`
# in R
make(
  your_plan,
  parallelism = "Makefile",
  jobs = 8,
  recipe_command = "srun testing.sl 'R_RECIPE'"
)

The later suggestion probably won't work anyway.

It doesn't seem to register the account when running with srun.

srun testing.sl 'drake::mk(target = "primer1", cache_path = "/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.drake")'
srun testing.sl 'drake::mk(target = "primer2", cache_path = "/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.drake")'
srun: error: Invalid account used
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
srun: error: Invalid account used
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
make: *** [/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.drake/ts/b3a79b8e12e4bcd5] Error 1
make: *** Waiting for unfinished jobs....
make: *** [/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.drake/ts/3c356dca4040e3c4] Error 

Should it not include multiple srun commands within the .sl file and run with sbatch at the terminal?
See Running Multiple Parallel Jobs Sequentially here: http://www.nersc.gov/users/computational-systems/cori/running-jobs/example-batch-scripts/

Then maybe SLURM doesn't see the #SBATCH args. Maybe revert back and try different ways to quote $1? Or not quote $1 at all?

# testing.sl
#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
module load R
srun Rscript -e '\"$1\"' # Maybe play around here.
# in R
make(
  your_plan,
  parallelism = "Makefile",
  jobs = 8,
  recipe_command = "sbatch testing.sl 'R_RECIPE'"
)

The above runs but again it submits all 5 jobs at once again. I tried a bunch of permutations of '\"$1\"' and nothing worked. I think at this point the best way forward is for you to get a test SLURM scheduler working for yourself as the back and forth is quite inefficient. I'm sorry I wasn't able to get this all the way to the end!

I really need all the help I can get to get SLURM working on Ubuntu 16.04.

As I mentioned in #115, I got SLURM to run on a Debian VM. (I followed this guide, substituting in my own user name instead of slurm, and setting both the master and node names to Debian64 (hostname of the VM)). The following worked perfectly for me.

library(drake)
load_basic_example()
make(
  my_plan,
  parallelism = "Makefile",
  jobs = 2,
  prepend = c(
    "SHELL=srun",
    ".SHELLFLAGS=-N1 -n1 bash -c"
  )
)

I am simultaneously stoked that something this simple actually worked and bothered that I cannot reproduce everyone's errors. I thought it might be because I listed myself in /etc/sudoers, but then it still worked when I took myself off and tried again. It could be something different about the real cluster environment.

How would one add the SBATCH configuration in the above?

Command line argument to srun should cover it here. For example, I can still set the job name (thought it's silly to have so many jobs with the same name).

make(
  my_plan,
  parallelism = "Makefile",
  jobs = 2,
  prepend = c(
    "SHELL=srun",
    ".SHELLFLAGS=-N1 -n1 bash -c"
  )
)

squeue showed me that they're all named testjob. But anyway, I pushed it to Stack Overflow.

@kendonB Please see the response on Stack Overflow. srun should be able to meet our needs.

Not surprisingly, SLURM arrays are not an option with this approach. The new rslurm package would cover this as a separate special backend. Given the other bottlenecks from drake itself, accommodating this special case may or may not be worth the efficiency gains.

@kendonB, from what you learned solving #115, do you think #117 could be solved the same way? Is it even worth the time now that you have #115? If you no longer need #117 to work, please let me know. Makefiles with srun seem to work for me, so I would prefer to either troubleshoot more with you or close the issue.

I had that sb config flag on this one, so this thread would have been a separate problem. Since #115 seems to be working for me, let's just close this issue until someone says they have the same problem.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wlandau picture wlandau  路  4Comments

htlin picture htlin  路  4Comments

wlandau picture wlandau  路  9Comments

rsangole picture rsangole  路  7Comments

boshek picture boshek  路  6Comments