drake's code of conduct.remotes::install_github("ropensci/drake")) and mention the SHA-1 hash of the Git commit you install.Running: ropensci/drake@ae3a72f
I have a particularly wicked problem. I am building 100 targets that take around 90 minutes each to build. The basic structure of the code is:
.fst in.fstI can easily shorten the targets so that they run quickly. When I do that, I have no problems. However, when I run the full version,
make(myplan, verbose = 3, jobs_preprocess = 8,
jobs = 2, parallelism = "clustermq",
log_progress = TRUE,
lazy_load = "promise",
console_log_file = NULL,
memory_strategy = "lookahead",
caching = "worker",
template = list(
log_file = "make.log",
memory = 5000,
walltime = 250
),
keep_going = TRUE
)
#> β ¦Warning in private$fill_options(...) :
#> Add 'CMQ_AUTH={{ auth }}' to template to enable socket authentication
#> target nzdi_df
#> target nzdi_df_2
2019-09-10 11:47:35.742293 | Master: tcp://mahuika01:6020
2019-09-10 11:47:35.803966 | WORKER_UP to: tcp://mahuika01:6020
2019-09-10 11:47:36.410054 | > DO_SETUP (0.548s wait)
2019-09-10 11:47:36.410445 | token from msg: set_common_data_token
2019-09-10 11:47:36.551556 | > DO_CALL (0.000s wait)
.ββ Attaching packages βββββββββββββββββββββββββββββββββββββββ tidyverse 1.2.1 ββ
β ggplot2 3.2.0 β purrr 0.3.2
β tibble 2.1.3 β dplyr 0.8.3
β tidyr 0.8.3 β stringr 1.4.0
β readr 1.3.1 β forcats 0.4.0
ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
β tidyr::expand() masks drake::expand()
β dplyr::filter() masks stats::filter()
β tidyr::gather() masks drake::gather()
β dplyr::lag() masks stats::lag()
Attaching package: βlubridateβ
The following object is masked from βpackage:baseβ:
date
[1] 5.952027e-05
[1] 0.0001190405
[1] 0.0001785608
[1] 0.0002380811
[1] 0.0002976013
[1] 0.0003571216
[1] 0.0004166419
[1] 0.0004761621
[1] 0.0005356824
...
[1] 0.9997024
[1] 0.9997619
[1] 0.9998214
[1] 0.999881
[1] 0.9999405
[1] 1 # This is my hacky progress indicator completing
[1] "Writing myfile1.fst completed."
..2019-09-10 13:17:18.947444 | eval'd: drake::cmq_buildtargetmetadepslayoutconfig
Error in clustermq:::worker("tcp://mahuika01:6020", verbose = TRUE) :
Timeout reached, terminating
Execution halted
and
[1] 0.9997024
[1] 0.9997619
[1] 0.9998214
[1] 0.999881
[1] 0.9999405
[1] 1
[1] "Writing myfile2.fst completed."
..2019-09-10 13:23:26.546925 | eval'd: drake::cmq_buildtargetmetadepslayoutconfig
Error in clustermq:::worker("tcp://mahuika01:6020", verbose = TRUE) :
Timeout reached, terminating
Execution halted
So it looks like the workers stop being able to communicate with the master at the end.
Can you give me some tips for troubleshooting this problem?
When I close the master R session, reopen and remake I do see that the targets that completed are up-to-date, thankfully.
Do you have a traceback? Also, if you are using fst via target(format = "fst"), does the problem still occur if you drop the format?
I'm just manually saving the fst file in the code. How might I get a traceback? As I said, the master process hangs.
Ah, so you got the timeout error message after killing the master manually? Could I take a look at make.log? (Or did you show it in the last two output blocks?) A console log file would also help.
I remember you also had an issue at some point that only appeared with a large number of workers. I wonder if it could be related.
The first chunk is the master process log (i.e. the console). The second and third chunks are the ends of the make.log1 and make.log2 files respectively (there is no make.log). Would you also like the starts of those logs?
The timeout error message above is only on the workers. They appeared about 10 minutes after the target completed on each and before I killed the master.
The console log file has become far more verbose than stdout. (I have gone back and forth about changing the argument name.) And it would help to view the full contents of the make.log* files.
...at least for a couple representative workers.
You mentioned you are manually reading from and writing to fst files within targets. Roughly how much data is traveling over ZeroMQ sockets? If your targets always return NULL, sounds like a really small amount of data.
The timeout message also makes me think timeout may have something to do with it. Maybe a timeout is reached but the master continues for some reason? Could be simple to test if you scale down the targets and set a low walltime.
Basically zero data is travelling through ZeroMQ sockets, you are right. There is one small data.frame that gets sent from the global environment as well as some functions.
I can confirm that no SLURM timeout gets reached as I could see the jobs still running in squeue before I killed the master and after I saw the clustermq timeout.
I will run it again with the console log on.
Could be simple to test if you scale down the targets and set a low walltime.
Can you elaborate on what you mean here? As I said above, the targets run fine when I make them smaller.
Here's what I see in the console log. I will get back to you if it adds anything once the targets are done.
make(myplan, verbose = 3, jobs_preprocess = 8,
jobs = 2, parallelism = "clustermq",
log_progress = TRUE,
lazy_load = "promise",
console_log_file = "console.log",
memory_strategy = "lookahead",
caching = "worker",
template = list(
log_file = "make.log",
memory = 5000,
walltime = 250
),
keep_going = TRUE
)
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.068895 +1200 GMT | | begin make()
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.069984 +1200 GMT | | begin drake_config()
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.073166 +1200 GMT | | cache /scale_wlg_persistent/filesets/project/landcare00063/projects_ac/update_vcsn_fst/.drake
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.109006 +1200 GMT | | analyze environment
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.141813 +1200 GMT | | get knitr hash
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.270152 +1200 GMT | | set knitr files
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.314558 +1200 GMT | | end drake_config()
mahuika01.mahuika.nesi.org.nz | 44649 | 2019-09-10 15:30:21.378302 +1200 GMT | elevation_vcsn_crosswalk | import
mahuika01.mahuika.nesi.org.nz | 44650 | 2019-09-10 15:30:21.379007 +1200 GMT | calculate_pet | import
mahuika01.mahuika.nesi.org.nz | 44652 | 2019-09-10 15:30:21.381144 +1200 GMT | dplyr::left_join | import
mahuika01.mahuika.nesi.org.nz | 44654 | 2019-09-10 15:30:21.383709 +1200 GMT | dplyr::summarise | import
mahuika01.mahuika.nesi.org.nz | 44656 | 2019-09-10 15:30:21.385890 +1200 GMT | dplyr::group_by | import
mahuika01.mahuika.nesi.org.nz | 44658 | 2019-09-10 15:30:21.386636 +1200 GMT | dplyr::mutate | import
mahuika01.mahuika.nesi.org.nz | 44650 | 2019-09-10 15:30:21.387500 +1200 GMT | dplyr::select | import
mahuika01.mahuika.nesi.org.nz | 44649 | 2019-09-10 15:30:21.388382 +1200 GMT | lubridate::month | import
mahuika01.mahuika.nesi.org.nz | 44660 | 2019-09-10 15:30:21.388785 +1200 GMT | dplyr::filter | import
mahuika01.mahuika.nesi.org.nz | 44652 | 2019-09-10 15:30:21.389468 +1200 GMT | dplyr::ungroup | import
mahuika01.mahuika.nesi.org.nz | 44661 | 2019-09-10 15:30:21.389912 +1200 GMT | lubridate::day | import
mahuika01.mahuika.nesi.org.nz | 44654 | 2019-09-10 15:30:21.393757 +1200 GMT | lubridate::yday | import
mahuika01.mahuika.nesi.org.nz | 44650 | 2019-09-10 15:30:21.395248 +1200 GMT | dplyr::pull | import
mahuika01.mahuika.nesi.org.nz | 44656 | 2019-09-10 15:30:21.396586 +1200 GMT | lubridate::as_date | import
mahuika01.mahuika.nesi.org.nz | 44649 | 2019-09-10 15:30:21.398477 +1200 GMT | tidyr::nest | import
mahuika01.mahuika.nesi.org.nz | 44652 | 2019-09-10 15:30:21.399343 +1200 GMT | dplyr::if_else | import
mahuika01.mahuika.nesi.org.nz | 44661 | 2019-09-10 15:30:21.399500 +1200 GMT | purrr::pmap | import
mahuika01.mahuika.nesi.org.nz | 44660 | 2019-09-10 15:30:21.400681 +1200 GMT | tidyr::unnest | import
mahuika01.mahuika.nesi.org.nz | 44654 | 2019-09-10 15:30:21.402168 +1200 GMT | tibble::as_tibble | import
mahuika01.mahuika.nesi.org.nz | 44658 | 2019-09-10 15:30:21.427605 +1200 GMT | fst::write_fst | import
mahuika01.mahuika.nesi.org.nz | 44656 | 2019-09-10 15:30:21.435928 +1200 GMT | fst::read_fst | import
mahuika01.mahuika.nesi.org.nz | 44673 | 2019-09-10 15:30:21.454173 +1200 GMT | get_spi60_for_single_day | import
mahuika01.mahuika.nesi.org.nz | 44674 | 2019-09-10 15:30:21.455229 +1200 GMT | add_nzdi_smd | import
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.463604 +1200 GMT | | begin outdated()
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.466201 +1200 GMT | | find more outdated targets
mahuika01.mahuika.nesi.org.nz | 44677 | 2019-09-10 15:30:21.589191 +1200 GMT | nzdi_df_3 | trigger command
mahuika01.mahuika.nesi.org.nz | 44682 | 2019-09-10 15:30:21.591450 +1200 GMT | nzdi_df_8 | trigger command
mahuika01.mahuika.nesi.org.nz | 44682 | 2019-09-10 15:30:21.596024 +1200 GMT | nzdi_df_56 | trigger command
mahuika01.mahuika.nesi.org.nz | 44678 | 2019-09-10 15:30:21.599369 +1200 GMT | nzdi_df_4 | trigger command
mahuika01.mahuika.nesi.org.nz | 44678 | 2019-09-10 15:30:21.603084 +1200 GMT | nzdi_df_12 | trigger command
mahuika01.mahuika.nesi.org.nz | 44678 | 2019-09-10 15:30:21.606262 +1200 GMT | nzdi_df_20 | trigger command
mahuika01.mahuika.nesi.org.nz | 44678 | 2019-09-10 15:30:21.609014 +1200 GMT | nzdi_df_28 | trigger command
mahuika01.mahuika.nesi.org.nz | 44678 | 2019-09-10 15:30:21.611794 +1200 GMT | nzdi_df_44 | trigger command
mahuika01.mahuika.nesi.org.nz | 44678 | 2019-09-10 15:30:21.614619 +1200 GMT | nzdi_df_52 | trigger command
mahuika01.mahuika.nesi.org.nz | 44678 | 2019-09-10 15:30:21.617398 +1200 GMT | nzdi_df_60 | trigger command
mahuika01.mahuika.nesi.org.nz | 44680 | 2019-09-10 15:30:21.618584 +1200 GMT | nzdi_df_6 | trigger command
mahuika01.mahuika.nesi.org.nz | 44681 | 2019-09-10 15:30:21.619565 +1200 GMT | nzdi_df_7 | trigger command
mahuika01.mahuika.nesi.org.nz | 44678 | 2019-09-10 15:30:21.619905 +1200 GMT | nzdi_df_68 | trigger command
mahuika01.mahuika.nesi.org.nz | 44680 | 2019-09-10 15:30:21.622573 +1200 GMT | nzdi_df_14 | trigger command
mahuika01.mahuika.nesi.org.nz | 44679 | 2019-09-10 15:30:21.623844 +1200 GMT | nzdi_df_5 | trigger command
mahuika01.mahuika.nesi.org.nz | 44681 | 2019-09-10 15:30:21.624164 +1200 GMT | nzdi_df_31 | trigger command
mahuika01.mahuika.nesi.org.nz | 44680 | 2019-09-10 15:30:21.626303 +1200 GMT | nzdi_df_54 | trigger command
mahuika01.mahuika.nesi.org.nz | 44679 | 2019-09-10 15:30:21.627523 +1200 GMT | nzdi_df_13 | trigger command
mahuika01.mahuika.nesi.org.nz | 44681 | 2019-09-10 15:30:21.627572 +1200 GMT | nzdi_df_47 | trigger command
mahuika01.mahuika.nesi.org.nz | 44680 | 2019-09-10 15:30:21.629130 +1200 GMT | nzdi_df_70 | trigger command
mahuika01.mahuika.nesi.org.nz | 44681 | 2019-09-10 15:30:21.630316 +1200 GMT | nzdi_df_55 | trigger command
mahuika01.mahuika.nesi.org.nz | 44679 | 2019-09-10 15:30:21.630700 +1200 GMT | nzdi_df_29 | trigger command
mahuika01.mahuika.nesi.org.nz | 44676 | 2019-09-10 15:30:21.631803 +1200 GMT | nzdi_df_2 | skip
mahuika01.mahuika.nesi.org.nz | 44681 | 2019-09-10 15:30:21.633431 +1200 GMT | nzdi_df_87 | trigger command
mahuika01.mahuika.nesi.org.nz | 44679 | 2019-09-10 15:30:21.633591 +1200 GMT | nzdi_df_53 | trigger command
mahuika01.mahuika.nesi.org.nz | 44676 | 2019-09-10 15:30:21.635862 +1200 GMT | nzdi_df_26 | trigger command
mahuika01.mahuika.nesi.org.nz | 44679 | 2019-09-10 15:30:21.636186 +1200 GMT | nzdi_df_61 | trigger command
mahuika01.mahuika.nesi.org.nz | 44681 | 2019-09-10 15:30:21.636323 +1200 GMT | nzdi_df_95 | trigger command
mahuika01.mahuika.nesi.org.nz | 44676 | 2019-09-10 15:30:21.638838 +1200 GMT | nzdi_df_34 | trigger command
mahuika01.mahuika.nesi.org.nz | 44679 | 2019-09-10 15:30:21.639058 +1200 GMT | nzdi_df_77 | trigger command
mahuika01.mahuika.nesi.org.nz | 44679 | 2019-09-10 15:30:21.641579 +1200 GMT | nzdi_df_85 | trigger commandmahuika01.mahuika.nesi.org.nz | 44676 | 2019-09-10 15:30:21.641566 +1200 GMT | nzdi_df_42 | trigger command
mahuika01.mahuika.nesi.org.nz | 44679 | 2019-09-10 15:30:21.644107 +1200 GMT | nzdi_df_93 | trigger command
mahuika01.mahuika.nesi.org.nz | 44676 | 2019-09-10 15:30:21.644819 +1200 GMT | nzdi_df_82 | trigger command
mahuika01.mahuika.nesi.org.nz | 44676 | 2019-09-10 15:30:21.647510 +1200 GMT | nzdi_df_90 | trigger command
mahuika01.mahuika.nesi.org.nz | 44675 | 2019-09-10 15:30:21.652359 +1200 GMT | nzdi_df | skip
mahuika01.mahuika.nesi.org.nz | 44675 | 2019-09-10 15:30:21.657043 +1200 GMT | nzdi_df_17 | trigger command
mahuika01.mahuika.nesi.org.nz | 44675 | 2019-09-10 15:30:21.660567 +1200 GMT | nzdi_df_33 | trigger command
mahuika01.mahuika.nesi.org.nz | 44675 | 2019-09-10 15:30:21.663843 +1200 GMT | nzdi_df_65 | trigger command
mahuika01.mahuika.nesi.org.nz | 44675 | 2019-09-10 15:30:21.666972 +1200 GMT | nzdi_df_89 | trigger command
mahuika01.mahuika.nesi.org.nz | 44675 | 2019-09-10 15:30:21.669829 +1200 GMT | nzdi_df_97 | trigger command
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.671193 +1200 GMT | | find more outdated targets
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.673549 +1200 GMT | | find downstream outdated targets
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.679045 +1200 GMT | | end outdated()
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.679639 +1200 GMT | | isolate oudated targets
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.687265 +1200 GMT | | construct priority queue
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.723451 +1200 GMT | nzdi_df_3 | trigger command
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.937588 +1200 GMT | | setting common data
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.951055 +1200 GMT | | begin scheduling targets
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:30.605054 +1200 GMT | | sending common data
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:30.641634 +1200 GMT | | sending common data
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:30.927170 +1200 GMT | nzdi_df_3 | trigger command
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:30.929628 +1200 GMT | nzdi_df_3 | target nzdi_df_3
wbn205 | 19557 | 2019-09-10 15:30:30.972838 +1200 GMT | nzdi_df_3 | build on an hpc worker
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:31.295992 +1200 GMT | nzdi_df_4 | trigger command
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:31.299542 +1200 GMT | nzdi_df_4 | target nzdi_df_4
wbn209 | 119536 | 2019-09-10 15:30:31.395678 +1200 GMT | nzdi_df_4 | build on an hpc worker
wbn205 | 19557 | 2019-09-10 16:52:09.370822 +1200 GMT | nzdi_df_3 | store
wbn205 | 19557 | 2019-09-10 16:52:09.587016 +1200 GMT | nzdi_df_3 | time 4896.886s (~1.36 hours) | 4872.732s (~1.35 hours) (exec | total)
wbn209 | 119536 | 2019-09-10 17:08:17.383939 +1200 GMT | nzdi_df_4 | store
wbn209 | 119536 | 2019-09-10 17:08:17.558529 +1200 GMT | nzdi_df_4 | time 5863.054s (~1.63 hours) | 5840.764s (~1.62 hours) (exec | total)
See above, @wlandau, the nzdi_df_3 target finished 6 minutes ago and the master still hasn't sent the next target for the worker to make.
I feel like I recall that the clustermq worker has a 10 minute timeout after being done and not receiving anything from the master.
Is there an exponential backoff on the master that is just backing off way too much?
:wave: Hey @kendonB...
Letting you know, @wlandau is currently OOO until Thursday, September 12th 2019. :heart:
I have updated the console.log file above. The worker logs are essentially the same as the earlier one above
Looks as though clustermq is timing out, as you suggested. Maybe increase the timeout in your template file as recommended at https://github.com/mschubert/clustermq/issues/146#issuecomment-508564778. Maybe I should set timeout = Inf as the default in all of drake's templates...
#!/bin/sh
# From https://github.com/mschubert/clustermq/wiki/SLURM
#SBATCH --job-name={{ job_name }} # job name
#SBATCH --partition=default # partition
#SBATCH --output={{ log_file | /dev/null }} # you can add .%a for array index
#SBATCH --error={{ log_file | /dev/null }} # log file
#SBATCH --mem-per-cpu={{ memory | 4096 }} # memory
#SBATCH --array=1-{{ n_jobs }} # job array
# module load R # Uncomment if R is an environment module.
ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}", timeout = {{ timeout | Inf }})'
cc @mschubert
Do you think that cmq is timing out on the master as well as the workers? If not, it's surprising that 10 minutes can go by with the worker not being able to tell the master that it's ready for more work. If the master is timing out, shouldn't there be a message to that effect? Perhaps the console log could also include periodic (e.g. every 5 minutes) messages saying that an idle master / worker is still alive as expected.
There is a worker timeout in case the master crashes, and the master timeout in case workers crash.
By default, the master timeout is Inf and the worker timeout 10 minutes. Both are configurable, so depending on what you do you may want to set them.
In any case, socket timeouts should go away in favor of disconnect monitors in v0.9, but that is not possible with rzmq and hence depends on replacing the bindings.
edit: Reading your original post, it's strange that the master does not receive the completion message from the worker. I routinely ran multi-hour worker calls, so that should in itself work. Can you reproduce with cmq alone?
Here is my slurm template for reference / clues. I'm guessing I wasn't able to get the R call to work on my system but using Rscript is probably not the culprit right?
#!/bin/sh
#SBATCH --job-name={{ job_name }}
#SBATCH --output={{ log_file | /dev/null }}%a # you can add .%a for array index
#SBATCH --mem-per-cpu={{ memory | 4096 }}
#SBATCH --array=1-{{ n_jobs }}
#SBATCH --account=landcare00063
#SBATCH --time={{ walltime }}
#SBATCH --partition={{ partition | large}}
source ~/.bashrc
source /etc/profile
## Export value of DEBUGME environemnt var to slave
## export DEBUGME=<%= Sys.getenv("DEBUGME") %>
ulimit -v $(( 1024 * {{ memory | 4096 }} ))
Rscript -e 'clustermq:::worker("{{ master }}", verbose = TRUE)'
## R --no-save --no-restore -e 'ulimit::memory_limit({{ memory | 4096 }}); clustermq:::worker("{{ master }}")'
@mschubert, I haven't tried to see if the problem persists using a pure clustermq solution. Any code/pseudo-code to help me do this is welcome but I'm sure I can figure it out from the docs.
clustermq::worker(timeout = Inf) in the template seems like a stopgap. But maybe the drake's main event loop should handle the worker errors directly. Right now we have
Maybe it should be
cmq_master <- function(config) {
on.exit(config$workers$finalize())
log_msg("begin scheduling targets", config = config)
while (config$counter$remaining > 0) {
msg <- config$workers$receive_data()
cmq_conclude_build(msg = msg, config = config)
if (identical(msg$id, "WORKER_ERROR")) {
stop("worker failed", call. = FALSE)
}
if (!identical(msg$token, "set_common_data_token")) {
log_msg("sending common data", config = config)
config$workers$send_common_data()
} else if (!config$queue$empty()) {
cmq_next_target(config)
} else {
config$workers$send_shutdown_worker()
}
}
if (config$workers$cleanup()) {
on.exit()
}
}
Just so I'm clear about my understanding of what's going on:
Inf timeout by default so should still be there constantly listening. The master unfortunately does not see when the workers are done. In the above case, there are only two workers so it shouldn't take anywhere near 10 minutes to poll two workers to see that they're done?So, I don't think adding an infinite timeout to the workers is going to solve anything here. It looks like there's something else stopping the communication happening in the first place.
I agree with @kendonB here. If you give me a reproducible example with cmq alone I can help debug
Good point, I do not know why it would take 10 whole minutes for a worker to contact the master after it finishes a target. (Certainly the 90 minutes it takes to build a target is longer than 10, but the 10-minute countdown starts after the target finishes, right?)
Yes, I see the timeout happen about 10 minutes after I see the ..2019-09-10 13:23:26.546925 | eval'd: drake::cmq_buildtargetmetadepslayoutconfig at the end of the worker log
Just stepping it through, it's getting stuck here at the beginning:
https://github.com/ropensci/rzmq/blob/bcc4c92f4653181720ea0183ecdaa73215cbad40/R/rzmq.R#L120
in the call to config$workers$receive_data() immediately after the second target is submitted.
I will check back in about an hour to see if it goes past this in the drake loop.
For reference:
It does not go past that point in the R code. So it definitely gets stuck here: https://github.com/ropensci/rzmq/blob/bcc4c92f4653181720ea0183ecdaa73215cbad40/R/rzmq.R#L120
in the call to config$workers$receive_data() immediately after the second target is submitted.
Maybe something like this? Not yet sure if it can reproduce the problem after we crank up wait.
library(clustermq)
#> * Option 'clustermq.scheduler' not set, defaulting to 'LOCAL'
#> --- see: https://mschubert.github.io/clustermq/articles/userguide.html#configuration
library(rlang)
options(clustermq.scheduler = "multicore")
test_run <- function(wait = 0.1) {
workers <- workers(n_jobs = 1L)
on.exit(workers$finalize())
workers$set_common_data(
export = list(),
fun = identity,
const = list(),
rettype = list(),
pkgs = character(0),
common_seed = 0L,
token = "set_common_data_token"
)
main_loop(workers = workers, wait = wait)
}
main_loop <- function(workers, wait) {
counter <- 3L
while (counter > 0L) {
print(counter)
msg <- workers$receive_data()
if (!is.null(msg$result)) {
counter <- counter - 1L
}
if (!identical(msg$token, "set_common_data_token")) {
workers$send_common_data()
}
else if (counter > 0L) {
workers$send_call(
expr = c(Sys.sleep(wait), 123),
env = list(wait = wait)
)
} else {
workers$send_shutdown_worker()
}
}
}
test_run(wait = 0.1)
#> [1] 3
#> [1] 3
#> [1] 3
#> [1] 2
#> [1] 1
#> Warning in workers$finalize(): Unclean shutdown for PIDs: 18389
Created on 2019-09-11 by the reprex package (v0.3.0)
I have this running now with a 90 minute wait time.
Can confirm I get the same behaviour as before; as in, the worker finishes up the work and the master process hangs.
In that case, should we move this issue to clustermq? (Maybe rzmq?) @mschubert, is this something that might resolve itself when you switch to custom ZeroMQ bindings?
@kendonB, what versions of clustermq, rzmq, and ZeroMQ are you using? And when you ran https://github.com/ropensci/drake/issues/1005#issuecomment-530637368, were you using the multicore backend or SLURM?
I was using SLURM. I can also try using multicore tomorrow.
I've tried your example and both multicore and slurm finish without errors for me.
What's your sessionInfo()?
Same here. I cranked up the wait time to 90 minutes, and the master process completed normally in a reasonable 4.5 hours on both multicore and SGE. I am using ZeroMQ version 4.2.3, and my own sessionInfo() is below.
@kendonB, I am closing this issue because I believe the problem comes from your setup and not drake itself, but I will still try to help if I can.
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS: CENSORED/R-3.6.0/lib64/R/lib/libRblas.so
LAPACK: CENSORED/R-3.6.0/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] clustermq_0.8.8
loaded via a namespace (and not attached):
[1] compiler_3.6.0 R6_2.4.0 tools_3.6.0 parallel_3.6.0 rzmq_0.9.6
I misunderstood what the script does. I shut mine down before 4.5 hours thinking I should expect it to be done in 90 minutes. I will try it again. My example was failing with 2 workers running 2 jobs (1 each) that each take around 90 minutes and return NULL. So I'll adjust the test code to have those properties.
I'm on ZeroMQ 4.2.5 and session info is:
```r
devtools::session_info()
β Session info βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
setting value
version R version 3.6.1 (2019-07-05)
os CentOS Linux 7 (Core)
system x86_64, linux-gnu
ui X11
language (EN)
collate en_NZ.UTF-8
ctype en_NZ.UTF-8
tz NZ
date 2019-09-13
β Packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
package * version date lib source
assertthat 0.2.1 2019-03-21 [2] CRAN (R 3.6.1)
backports 1.1.4 2019-04-10 [2] CRAN (R 3.6.1)
callr 3.3.0 2019-07-04 [2] CRAN (R 3.6.1)
cli 1.1.0 2019-03-19 [2] CRAN (R 3.6.1)
clustermq * 0.8.8 2019-06-05 [1] CRAN (R 3.6.1)
crayon 1.3.4 2017-09-16 [2] CRAN (R 3.6.1)
desc 1.2.0 2018-05-01 [2] CRAN (R 3.6.1)
devtools 2.1.0 2019-07-06 [1] CRAN (R 3.6.1)
digest 0.6.20 2019-07-04 [2] CRAN (R 3.6.1)
fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1)
glue 1.3.1 2019-03-12 [2] CRAN (R 3.6.1)
magrittr 1.5 2014-11-22 [2] CRAN (R 3.6.1)
memoise 1.1.0 2017-04-21 [2] CRAN (R 3.6.1)
pkgbuild 1.0.3 2019-03-20 [2] CRAN (R 3.6.1)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.1)
prettyunits 1.0.2 2015-07-13 [2] CRAN (R 3.6.1)
processx 3.4.0 2019-07-03 [2] CRAN (R 3.6.1)
ps 1.3.0 2018-12-21 [2] CRAN (R 3.6.1)
R6 2.4.0 2019-02-14 [2] CRAN (R 3.6.1)
Rcpp 1.0.2 2019-07-25 [1] CRAN (R 3.6.1)
remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.1)
rlang 0.4.0 2019-06-25 [2] CRAN (R 3.6.1)
rprojroot 1.3-2 2018-01-03 [2] CRAN (R 3.6.1)
rzmq * 0.9.6 2019-03-01 [1] CRAN (R 3.6.1)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.1)
testthat 2.2.1 2019-07-25 [1] CRAN (R 3.6.1)
usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.1)
withr 2.1.2 2018-03-15 [2] CRAN (R 3.6.1)```
Deleted a comment as I got test processes mixed up. Double checking now that it actually does work in the minimal clustermq example.
I guess I should test again with 2 workers.
OK apologies. I have now run this exact script:
library(clustermq)
#> * Option 'clustermq.scheduler' not set, defaulting to 'LOCAL'
#> --- see: https://mschubert.github.io/clustermq/articles/userguide.html#configuration
library(rlang)
options(clustermq.scheduler = "slurm",
clustermq.template = "slurm_clustermq.tmpl")
test_run <- function(wait = 0.1) {
workers <- workers(n_jobs = 2L, template =
list(
log_file = "make.log",
memory = 500,
walltime = 250
))
on.exit(workers$finalize())
workers$set_common_data(
export = list(),
fun = identity,
const = list(),
rettype = list(),
pkgs = character(0),
common_seed = 0L,
token = "set_common_data_token"
)
main_loop(workers = workers, wait = wait)
}
main_loop <- function(workers, wait) {
counter <- 2L
while (counter > 0L) {
print(counter)
msg <- workers$receive_data()
if (!is.null(msg$result)) {
counter <- counter - 1L
}
if (!identical(msg$token, "set_common_data_token")) {
workers$send_common_data()
}
else if (counter > 0L) {
workers$send_call(
expr = {
print("I'm working.")
c(Sys.sleep(wait), 123)
NULL
},
env = list(wait = wait)
)
} else {
workers$send_shutdown_worker()
}
}
}
test_run(wait = 3600*1.5)
The master process hangs (presumably at the workers$receive_data() call) and the two worker logs are:
> clustermq:::worker("tcp://mahuika01:7649")
2019-09-13 12:58:54.413862 | Master: tcp://mahuika01:7649
2019-09-13 12:58:54.422901 | WORKER_UP to: tcp://mahuika01:7649
2019-09-13 12:58:54.424276 | > DO_SETUP (0.000s wait)
2019-09-13 12:58:54.424635 | token from msg: set_common_data_token
2019-09-13 12:58:54.449676 | > DO_CALL (0.000s wait)
[1] "I'm working."
2019-09-13 14:28:54.486711 | eval'd: {print("I'm working.")c(Sys.sleep(wait), 123)NULL
Error in clustermq:::worker("tcp://mahuika01:7649") :
Timeout reached, terminating
Execution halted
and
> clustermq:::worker("tcp://mahuika01:7649")
2019-09-13 12:58:55.203638 | Master: tcp://mahuika01:7649
2019-09-13 12:58:55.217463 | WORKER_UP to: tcp://mahuika01:7649
2019-09-13 12:58:55.218325 | > DO_SETUP (0.000s wait)
2019-09-13 12:58:55.218681 | token from msg: set_common_data_token
2019-09-13 12:58:55.226476 | > DO_CALL (0.001s wait)
[1] "I'm working."
2019-09-13 14:28:55.327345 | eval'd: {print("I'm working.")c(Sys.sleep(wait), 123)NULL
Error in clustermq:::worker("tcp://mahuika01:7649") :
Timeout reached, terminating
Execution halted
and on the master I see:
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
My template file is:
#!/bin/sh
#SBATCH --job-name={{ job_name }}
#SBATCH --output={{ log_file | /dev/null }}%a # you can add .%a for array index
#SBATCH --mem-per-cpu={{ memory | 4096 }}
#SBATCH --array=1-{{ n_jobs }}
#SBATCH --account=landcare00063
#SBATCH --time={{ walltime }}
#SBATCH --partition={{ partition | large}}
source ~/.bashrc
source /etc/profile
## Export value of DEBUGME environemnt var to slave
## export DEBUGME=<%= Sys.getenv("DEBUGME") %>
ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'
My session info is:
> devtools::session_info()
β Session info βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
setting value
version R version 3.6.1 (2019-07-05)
os CentOS Linux 7 (Core)
system x86_64, linux-gnu
ui X11
language (EN)
collate en_NZ.UTF-8
ctype en_NZ.UTF-8
tz NZ
date 2019-09-13
β Packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
package * version date lib source
assertthat 0.2.1 2019-03-21 [2] CRAN (R 3.6.1)
backports 1.1.4 2019-04-10 [2] CRAN (R 3.6.1)
callr 3.3.0 2019-07-04 [2] CRAN (R 3.6.1)
cli 1.1.0 2019-03-19 [2] CRAN (R 3.6.1)
clustermq * 0.8.8 2019-06-05 [1] CRAN (R 3.6.1)
crayon 1.3.4 2017-09-16 [2] CRAN (R 3.6.1)
desc 1.2.0 2018-05-01 [2] CRAN (R 3.6.1)
devtools 2.1.0 2019-07-06 [1] CRAN (R 3.6.1)
digest 0.6.20 2019-07-04 [2] CRAN (R 3.6.1)
fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1)
glue 1.3.1 2019-03-12 [2] CRAN (R 3.6.1)
magrittr 1.5 2014-11-22 [2] CRAN (R 3.6.1)
memoise 1.1.0 2017-04-21 [2] CRAN (R 3.6.1)
pkgbuild 1.0.3 2019-03-20 [2] CRAN (R 3.6.1)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.1)
prettyunits 1.0.2 2015-07-13 [2] CRAN (R 3.6.1)
processx 3.4.0 2019-07-03 [2] CRAN (R 3.6.1)
ps 1.3.0 2018-12-21 [2] CRAN (R 3.6.1)
R6 2.4.0 2019-02-14 [2] CRAN (R 3.6.1)
Rcpp 1.0.2 2019-07-25 [1] CRAN (R 3.6.1)
remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.1)
rlang * 0.4.0 2019-06-25 [2] CRAN (R 3.6.1)
rprojroot 1.3-2 2018-01-03 [2] CRAN (R 3.6.1)
rzmq 0.9.6 2019-03-01 [1] CRAN (R 3.6.1)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.1)
testthat 2.2.1 2019-07-25 [1] CRAN (R 3.6.1)
usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.1)
withr 2.1.2 2018-03-15 [2] CRAN (R 3.6.1)
And my ZeroMQ version is:
> rzmq::zmq.version()
[1] "4.2.5"
@wlandau I'm happy to move this off the drake github and onto the clustermq github if you like. @mschubert what other troubleshooting steps could I do since it does seem environmental if this code is working for both of you.
@kendonB, thank you for the thorough follow-up. I agree to move this to clustermq.