Drake: Collect quick high-performance computing examples for drake

Created on 24 Oct 2017  路  20Comments  路  Source: ropensci/drake

Related to wlandau-lilly/drake#42 and HenrikBengtsson/future.batchtools#9. I want to collect quickstart examples in inst/examples to help users deploy to SLURM, TORQUE, the Sun/Univa Grid Engine, Docker, etc. I do not have access to many of these systems (though perhaps with Docker I might), so I would greatly appreciate community help.

The examples should be like the Makefile-cluster example. The user should be able to run drake::example_drake("YOUR_EXAMPLE") to generate some short and supporting files, then go into the folder and run the example easily and quickly.

To-do list for this issue:

  • [ ] [Docker-psock example](https://github.com/wlandau-lilly/drake/tree/master/inst/examples/Docker-psock). See HenrikBengtsson/future#174.
  • [x] [Makefile-cluster example](https://github.com/wlandau-lilly/drake/tree/master/inst/examples/Makefile-cluster). I have used this one in several projects, and I am confident in its current state.
  • [x] [Sun/Univa Grid Engine example](https://github.com/wlandau-lilly/drake/tree/master/inst/examples/sge). I just got it to work.
  • [x] [SLURM example](https://github.com/wlandau-lilly/drake/tree/master/inst/examples/slurm). See HenrikBengtsson/future.batchtools#11.
  • [ ] [TORQUE example](https://github.com/wlandau-lilly/drake/tree/master/inst/examples/torque). Depending on how the installation goes, I may be able to handle this one. More remains to be seen.
  • [ ] Dockerfiles for for the Docker-psock, SGE, SLURM, and TORQUE examples, maybe more.

Community help requested

I am requesting help from the community on this one because I am having a hard time getting TORQUE and Docker to work. If you can confirm that the TORQUE example works, please let me know. If you can share Dockerfiles for the examples, especially the ones that use SLURM, TORQUE, or SGE, that would be fantastic.

help or input

Most helpful comment

Do you happen to have any recommendations of simple/stable images on Docker Hub?

I used these images to test the ins and outs of some schedulers:

agaveapi/slurm
agaveapi/gridengine
vimalkvn/docker-openlava-lsa

I did not find any images with R installations though.

All 20 comments

I found https://github.com/mllg/batchtools/tree/master/inst/templates, so I will consider this issue solved if I add a Docker example and get some of these cluster examples working. I tried the sge example on the Grid Engine, but I am not experienced enough with this approach to know what is going wrong. @mllg and @HenrikBengtsson, I would greatly appreciate your advice.

# install_github("wlandau-lilly/drake") # This functionality is not yet on CRAN
drake::example_drake("sge")
setwd("sge")
source("run.R")
## Loading required package: future
## 
## Attaching package: 'drake'
## 
## The following object is masked from 'package:future.batchtools':
## 
##     status
## 
## The following object is masked from 'package:future':
## 
##     plan
## 
## check 9 items: 'report.Rmd', knit, summary, suppressWarnings, coefficients, d...
## import 'report.Rmd'
## import knit
## import summary
## import suppressWarnings
## import coefficients
## import data.frame
## import rpois
## import stats::rnorm
## import lm
## check 3 items: simulate, reg1, reg2
## import simulate
## import reg1
## import reg2
## check 9 items: 'report.Rmd', knit, summary, suppressWarnings, coefficients, d...
## Error in cat(job.name) : object 'job.name' not found
## Error in cfBrewTemplate(reg, template, jc) :
##   Error brewing template: Error in cat(job.name) : object 'job.name' not found

I can give it a try tomorrow on my HPCs.

Shooting from the hip: See my torque.tmpl file and particular line 9.

I've tested it on a slurm cluster and got it running after the following tweaks:

  1. I had to use a custom template. Unfortunately there is no "default" template which works on all systems. The "simple" or default templates should run on standard installations though.
  2. You had a typo in the template file name ('bachtools')

Shooting from the hip: See my torque.tmpl file and particular line 9.

That's a workaround for older versions of batchtools. job.name should always exist in recent installations of batchtools (https://github.com/mllg/batchtools/blob/master/R/JobCollection.R#L52).

@wlandau-lilly Can I get a session info?

Thank you, @mllg! I just fixed the typo you mentioned. Here is my sessionInfo(). Unfortunately, I will not have direct access to SLURM until I get up and running with Docker (hopefully relatively soon).

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.9 (Santiago)

Matrix products: default
BLAS: .../R-3.4.0/lib64/R/lib/libRblas.so
LAPACK: .../R-3.4.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US       LC_NUMERIC=C         LC_TIME=en_US
 [4] LC_COLLATE=en_US     LC_MONETARY=en_US    LC_MESSAGES=en_US
 [7] LC_PAPER=en_US       LC_NAME=C            LC_ADDRESS=C
[10] LC_TELEPHONE=C       LC_MEASUREMENT=en_US LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.0

I put those ellipses in manually rather than give the full paths.

@mllg Also, since there is no one-size-fits-all configuration file, I think it would be enough to get each example working in a Docker image. Do you happen to have any recommendations of simple/stable images on Docker Hub?

@HenrikBengtsson I like the simplicity of your TORQUE file.

I now think we have enough examples in drake (though I am always open to more, up to the 5MB limit for the size of a package). Solving this issue is just a matter of including the right Docker image files to reproduce them for users.

Do you happen to have any recommendations of simple/stable images on Docker Hub?

I used these images to test the ins and outs of some schedulers:

agaveapi/slurm
agaveapi/gridengine
vimalkvn/docker-openlava-lsa

I did not find any images with R installations though.

Here is my sessionInfo()
[...]

Which version of batchtools is installed?

@mllg good call. I was using 0.9.3, so I just upgraded to 0.9.6. Now I get different errors that seems related to the SGE configuration.

# install_github("wlandau-lilly/drake") # This functionality is not yet on CRAN
drake::example_drake("sge")
setwd("sge")
source("run.R")
## Loading required package: future
## 
## Attaching package: 'drake'
## 
## The following object is masked from 'package:future.batchtools':
## 
##     status
## 
## The following object is masked from 'package:future':
## 
##     plan
## 
## check 9 items: 'report.Rmd', knit, summary, suppressWarnings, coefficients, d...
## import 'report.Rmd'
## import knit
## import summary
## import suppressWarnings
## import coefficients
## import data.frame
## import rpois
## import stats::rnorm
## import lm
## check 3 items: simulate, reg1, reg2
## import simulate
## import reg1
## import reg2
## check 9 items: 'report.Rmd', knit, summary, suppressWarnings, coefficients, d...
## Error in OSError("Listing of jobs failed", res) :
##   Listing of jobs failed (exit code 127);
## cmd: 'squeue -h -o %i -u $USER -t R,S,CG -r'
## output:
## command not found
## Calls: make ... unique -> <Anonymous> -> listJobs -> OSError -> stopf
## Execution halted
## Error in OSError("Listing of jobs failed", res) :
##   Listing of jobs failed (exit code 127);
## cmd: 'squeue -h -o %i -u $USER -t R,S,CG -r'
## output:
## command not found

Highly relevant: mllg/batchtools#148

As per @wlandau-lilly's request, I described my experience here: https://github.com/wlandau-lilly/drake/issues/115

However, I'm now stuck, so, unfortunately, I don't have much to add. I will have some time to troubleshoot more later so might have more to contribute in the coming weeks.

@mllg I got the SGE example to work! The problem was that I was using batchtools_slurm instead of batchtools_sge by mistake.

See the revision to the top of this thread for a new to-do list.

By the way, I managed to install TORQUE locally, but I am still having trouble. For the current TORQUE example, the jobs get submitted now, but they hang indefinitely in the "completed' or "exiting" stages.

$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
41.localhost              ...1eb53a193c2ca wlandau         00:00:00 E batch          
42.localhost              ...f823468009a27 wlandau         00:00:00 E batch 

@mllg and @HenrikBengtsson, anything I could try on Ubuntu 16.04 to fix this? Also, does the template file look okay? I need to make sure all file I/O happens in the project's root directory.

And I am still struggling to find a way to run SLURM. I tried to run docker pull agaveapi/slurm; docker run agaveapi/slurm, but I am getting errors:

/usr/lib/python2.6/site-packages/supervisor/options.py:295: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
  'Supervisord is running as root and it is searching '
2017-10-29 15:27:45,436 CRIT Supervisor running as root (no user in config file)
2017-10-29 15:27:45,437 INFO supervisord started with pid 1
2017-10-29 15:27:46,439 INFO spawned: 'slurmd' with pid 9
2017-10-29 15:27:46,441 INFO spawned: 'sshd' with pid 10
2017-10-29 15:27:46,443 INFO spawned: 'munge' with pid 11
2017-10-29 15:27:46,443 INFO spawned: 'slurmctld' with pid 12
2017-10-29 15:27:46,452 INFO exited: munge (exit status 0; not expected)
2017-10-29 15:27:46,452 CRIT reaped unknown pid 13)
2017-10-29 15:27:46,530 INFO gave up: munge entered FATAL state, too many start retries too quickly
2017-10-29 15:27:46,531 INFO exited: slurmd (exit status 1; not expected)
2017-10-29 15:27:46,535 INFO gave up: slurmd entered FATAL state, too many start retries too quickly
2017-10-29 15:27:46,536 INFO exited: slurmctld (exit status 0; not expected)
2017-10-29 15:27:47,537 INFO success: sshd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-10-29 15:27:47,537 INFO gave up: slurmctld entered FATAL state, too many start retries too quickly

And I am struggling to install SLURM locally on Ubuntu 16.04. Given how little I can do on my own, I think I will slow the pace of my work on this thread, #115, and #117, and publish the next CRAN release in November with a caveat that these examples are under development.

I added cautionary notes and asked for community help in the caution vignette, parallelism vignette, example README, and the issue template. With these notes, I feel better about releasing the next CRAN update despite HPC trouble.

Closing this issue because:

  • The TORQUE example is just a matter of the right configuration (and a working installation, which I do not quite have), which is really up to the user. The SLURM and SGE examples worked for me and others, and that should be enough
  • In the Docker-psock example, the problem is that igraph will not install. I plan to comment on a much-repeated thread of the rigraph issue tracker about this.
  • Curating Dockerfiles, especially for high-performance computing plus R, really belongs in batchtools or Rocker. See mllg/batchtools#148.

Will reopen if someone thinks I have not done enough, but I think we are covered. See the caveats in the caution and parallelism vignettes.

I do plan to learn Docker and install SLURM and TORQUE properly, but outside the scope of the development of drake. I did not expect it to be this hard, so the fact that it was pushing the scope of drake did not matter until recently.

I do plan to learn Docker and install SLURM and TORQUE properly, but outside the scope of the development of drake. I did not expect it to be this hard, so the fact that it was pushing the scope of drake did not matter until recently.

I completely agree. Setting up these HPC systems is hard, and testing is even harder because the ins and outs change with installed versions and configuration settings. Henrik's future package provides an excellent front end to all important parallelization backends, thus you must only make sure that drake works locally and in a mode with spawned processes (e.g., socket mode) to ensure that all variables are exported properly (although I'm sure that Henrik already has plenty of unit tests for that).

I'll look into some more testing using docker images.

Was this page helpful?
0 / 5 - 0 ratings