Drake: clustermq run works for short-running targets but fails with long-running targets

Created on 10 Sep 2019 · 36Comments · Source: ropensci/drake

Prework

[x] Read and abide by drake's code of conduct.
[x] Search for duplicates among the existing issues, both open and closed.
[x] Advanced users: verify that the bug still persists in the current development version (i.e. remotes::install_github("ropensci/drake")) and mention the SHA-1 hash of the Git commit you install.

Description

I have a particularly wicked problem. I am building 100 targets that take around 90 minutes each to build. The basic structure of the code is:

Read a .fst in
Process for 90 minutes in memory
Write a similarly sized .fst
Return NULL

I can easily shorten the targets so that they run quickly. When I do that, I have no problems. However, when I run the full version,

The master hangs, as in I see just:

make(myplan, verbose = 3, jobs_preprocess = 8, 
     jobs = 2, parallelism = "clustermq",
     log_progress = TRUE,
     lazy_load = "promise",
     console_log_file = NULL, 
     memory_strategy = "lookahead", 
     caching = "worker",
     template = list(
       log_file = "make.log", 
       memory = 5000,
       walltime = 250
     ),
     keep_going = TRUE
)
#> ⠦Warning in private$fill_options(...) :
#>   Add 'CMQ_AUTH={{ auth }}' to template to enable socket authentication
#> target nzdi_df
#> target nzdi_df_2

I can see the targets complete in the logs. As in, I see:

2019-09-10 11:47:35.742293 | Master: tcp://mahuika01:6020
2019-09-10 11:47:35.803966 | WORKER_UP to: tcp://mahuika01:6020
2019-09-10 11:47:36.410054 | > DO_SETUP (0.548s wait)
2019-09-10 11:47:36.410445 | token from msg: set_common_data_token
2019-09-10 11:47:36.551556 | > DO_CALL (0.000s wait)

.── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.0     ✔ purrr   0.3.2
✔ tibble  2.1.3     ✔ dplyr   0.8.3
✔ tidyr   0.8.3     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::expand() masks drake::expand()
✖ dplyr::filter() masks stats::filter()
✖ tidyr::gather() masks drake::gather()
✖ dplyr::lag()    masks stats::lag()

Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date

[1] 5.952027e-05
[1] 0.0001190405
[1] 0.0001785608
[1] 0.0002380811
[1] 0.0002976013
[1] 0.0003571216
[1] 0.0004166419
[1] 0.0004761621
[1] 0.0005356824
...
[1] 0.9997024
[1] 0.9997619
[1] 0.9998214
[1] 0.999881
[1] 0.9999405
[1] 1 # This is my hacky progress indicator completing
[1] "Writing myfile1.fst completed."
..2019-09-10 13:17:18.947444 | eval'd: drake::cmq_buildtargetmetadepslayoutconfig
Error in clustermq:::worker("tcp://mahuika01:6020", verbose = TRUE) :
  Timeout reached, terminating
Execution halted

and

[1] 0.9997024
[1] 0.9997619
[1] 0.9998214
[1] 0.999881
[1] 0.9999405
[1] 1
[1] "Writing myfile2.fst completed."
..2019-09-10 13:23:26.546925 | eval'd: drake::cmq_buildtargetmetadepslayoutconfig
Error in clustermq:::worker("tcp://mahuika01:6020", verbose = TRUE) :
  Timeout reached, terminating
Execution halted

So it looks like the workers stop being able to communicate with the master at the end.

Can you give me some tips for troubleshooting this problem?

When I close the master R session, reopen and remake I do see that the targets that completed are up-to-date, thankfully.

use case

Source

kendonB

All 36 comments

Do you have a traceback? Also, if you are using fst via target(format = "fst"), does the problem still occur if you drop the format?

wlandau on 10 Sep 2019

I'm just manually saving the fst file in the code. How might I get a traceback? As I said, the master process hangs.

kendonB on 10 Sep 2019

Ah, so you got the timeout error message after killing the master manually? Could I take a look at make.log? (Or did you show it in the last two output blocks?) A console log file would also help.

I remember you also had an issue at some point that only appeared with a large number of workers. I wonder if it could be related.

wlandau on 10 Sep 2019

The first chunk is the master process log (i.e. the console). The second and third chunks are the ends of the make.log1 and make.log2 files respectively (there is no make.log). Would you also like the starts of those logs?

The timeout error message above is only on the workers. They appeared about 10 minutes after the target completed on each and before I killed the master.

kendonB on 10 Sep 2019

The console log file has become far more verbose than stdout. (I have gone back and forth about changing the argument name.) And it would help to view the full contents of the make.log* files.

wlandau on 10 Sep 2019

...at least for a couple representative workers.

wlandau on 10 Sep 2019

You mentioned you are manually reading from and writing to fst files within targets. Roughly how much data is traveling over ZeroMQ sockets? If your targets always return NULL, sounds like a really small amount of data.

The timeout message also makes me think timeout may have something to do with it. Maybe a timeout is reached but the master continues for some reason? Could be simple to test if you scale down the targets and set a low walltime.

wlandau on 10 Sep 2019

Basically zero data is travelling through ZeroMQ sockets, you are right. There is one small data.frame that gets sent from the global environment as well as some functions.

I can confirm that no SLURM timeout gets reached as I could see the jobs still running in squeue before I killed the master and after I saw the clustermq timeout.

I will run it again with the console log on.

Could be simple to test if you scale down the targets and set a low walltime.

Can you elaborate on what you mean here? As I said above, the targets run fine when I make them smaller.

kendonB on 10 Sep 2019

Here's what I see in the console log. I will get back to you if it adds anything once the targets are done.

make(myplan, verbose = 3, jobs_preprocess = 8, 
     jobs = 2, parallelism = "clustermq",
     log_progress = TRUE,
     lazy_load = "promise",
     console_log_file = "console.log", 
     memory_strategy = "lookahead", 
     caching = "worker",
     template = list(
       log_file = "make.log", 
       memory = 5000,
       walltime = 250
     ),
     keep_going = TRUE
)

mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.068895 +1200 GMT |  | begin make()
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.069984 +1200 GMT |  | begin drake_config()
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.073166 +1200 GMT |  | cache /scale_wlg_persistent/filesets/project/landcare00063/projects_ac/update_vcsn_fst/.drake
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.109006 +1200 GMT |  | analyze environment
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.141813 +1200 GMT |  | get knitr hash
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.270152 +1200 GMT |  | set knitr files
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.314558 +1200 GMT |  | end drake_config()
mahuika01.mahuika.nesi.org.nz | 44649 | 2019-09-10 15:30:21.378302 +1200 GMT | elevation_vcsn_crosswalk | import
mahuika01.mahuika.nesi.org.nz | 44650 | 2019-09-10 15:30:21.379007 +1200 GMT | calculate_pet | import
mahuika01.mahuika.nesi.org.nz | 44652 | 2019-09-10 15:30:21.381144 +1200 GMT | dplyr::left_join | import
mahuika01.mahuika.nesi.org.nz | 44654 | 2019-09-10 15:30:21.383709 +1200 GMT | dplyr::summarise | import
mahuika01.mahuika.nesi.org.nz | 44656 | 2019-09-10 15:30:21.385890 +1200 GMT | dplyr::group_by | import
mahuika01.mahuika.nesi.org.nz | 44658 | 2019-09-10 15:30:21.386636 +1200 GMT | dplyr::mutate | import
mahuika01.mahuika.nesi.org.nz | 44650 | 2019-09-10 15:30:21.387500 +1200 GMT | dplyr::select | import
mahuika01.mahuika.nesi.org.nz | 44649 | 2019-09-10 15:30:21.388382 +1200 GMT | lubridate::month | import
mahuika01.mahuika.nesi.org.nz | 44660 | 2019-09-10 15:30:21.388785 +1200 GMT | dplyr::filter | import
mahuika01.mahuika.nesi.org.nz | 44652 | 2019-09-10 15:30:21.389468 +1200 GMT | dplyr::ungroup | import
mahuika01.mahuika.nesi.org.nz | 44661 | 2019-09-10 15:30:21.389912 +1200 GMT | lubridate::day | import
mahuika01.mahuika.nesi.org.nz | 44654 | 2019-09-10 15:30:21.393757 +1200 GMT | lubridate::yday | import
mahuika01.mahuika.nesi.org.nz | 44650 | 2019-09-10 15:30:21.395248 +1200 GMT | dplyr::pull | import
mahuika01.mahuika.nesi.org.nz | 44656 | 2019-09-10 15:30:21.396586 +1200 GMT | lubridate::as_date | import
mahuika01.mahuika.nesi.org.nz | 44649 | 2019-09-10 15:30:21.398477 +1200 GMT | tidyr::nest | import
mahuika01.mahuika.nesi.org.nz | 44652 | 2019-09-10 15:30:21.399343 +1200 GMT | dplyr::if_else | import
mahuika01.mahuika.nesi.org.nz | 44661 | 2019-09-10 15:30:21.399500 +1200 GMT | purrr::pmap | import
mahuika01.mahuika.nesi.org.nz | 44660 | 2019-09-10 15:30:21.400681 +1200 GMT | tidyr::unnest | import
mahuika01.mahuika.nesi.org.nz | 44654 | 2019-09-10 15:30:21.402168 +1200 GMT | tibble::as_tibble | import
mahuika01.mahuika.nesi.org.nz | 44658 | 2019-09-10 15:30:21.427605 +1200 GMT | fst::write_fst | import
mahuika01.mahuika.nesi.org.nz | 44656 | 2019-09-10 15:30:21.435928 +1200 GMT | fst::read_fst | import
mahuika01.mahuika.nesi.org.nz | 44673 | 2019-09-10 15:30:21.454173 +1200 GMT | get_spi60_for_single_day | import
mahuika01.mahuika.nesi.org.nz | 44674 | 2019-09-10 15:30:21.455229 +1200 GMT | add_nzdi_smd | import
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.463604 +1200 GMT |  | begin outdated()
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.466201 +1200 GMT |  | find more outdated targets
mahuika01.mahuika.nesi.org.nz | 44677 | 2019-09-10 15:30:21.589191 +1200 GMT | nzdi_df_3 | trigger command
mahuika01.mahuika.nesi.org.nz | 44682 | 2019-09-10 15:30:21.591450 +1200 GMT | nzdi_df_8 | trigger command
mahuika01.mahuika.nesi.org.nz | 44682 | 2019-09-10 15:30:21.596024 +1200 GMT | nzdi_df_56 | trigger command
mahuika01.mahuika.nesi.org.nz | 44678 | 2019-09-10 15:30:21.599369 +1200 GMT | nzdi_df_4 | trigger command
mahuika01.mahuika.nesi.org.nz | 44678 | 2019-09-10 15:30:21.603084 +1200 GMT | nzdi_df_12 | trigger command
mahuika01.mahuika.nesi.org.nz | 44678 | 2019-09-10 15:30:21.606262 +1200 GMT | nzdi_df_20 | trigger command
mahuika01.mahuika.nesi.org.nz | 44678 | 2019-09-10 15:30:21.609014 +1200 GMT | nzdi_df_28 | trigger command
mahuika01.mahuika.nesi.org.nz | 44678 | 2019-09-10 15:30:21.611794 +1200 GMT | nzdi_df_44 | trigger command
mahuika01.mahuika.nesi.org.nz | 44678 | 2019-09-10 15:30:21.614619 +1200 GMT | nzdi_df_52 | trigger command
mahuika01.mahuika.nesi.org.nz | 44678 | 2019-09-10 15:30:21.617398 +1200 GMT | nzdi_df_60 | trigger command
mahuika01.mahuika.nesi.org.nz | 44680 | 2019-09-10 15:30:21.618584 +1200 GMT | nzdi_df_6 | trigger command
mahuika01.mahuika.nesi.org.nz | 44681 | 2019-09-10 15:30:21.619565 +1200 GMT | nzdi_df_7 | trigger command
mahuika01.mahuika.nesi.org.nz | 44678 | 2019-09-10 15:30:21.619905 +1200 GMT | nzdi_df_68 | trigger command
mahuika01.mahuika.nesi.org.nz | 44680 | 2019-09-10 15:30:21.622573 +1200 GMT | nzdi_df_14 | trigger command
mahuika01.mahuika.nesi.org.nz | 44679 | 2019-09-10 15:30:21.623844 +1200 GMT | nzdi_df_5 | trigger command
mahuika01.mahuika.nesi.org.nz | 44681 | 2019-09-10 15:30:21.624164 +1200 GMT | nzdi_df_31 | trigger command
mahuika01.mahuika.nesi.org.nz | 44680 | 2019-09-10 15:30:21.626303 +1200 GMT | nzdi_df_54 | trigger command
mahuika01.mahuika.nesi.org.nz | 44679 | 2019-09-10 15:30:21.627523 +1200 GMT | nzdi_df_13 | trigger command
mahuika01.mahuika.nesi.org.nz | 44681 | 2019-09-10 15:30:21.627572 +1200 GMT | nzdi_df_47 | trigger command
mahuika01.mahuika.nesi.org.nz | 44680 | 2019-09-10 15:30:21.629130 +1200 GMT | nzdi_df_70 | trigger command
mahuika01.mahuika.nesi.org.nz | 44681 | 2019-09-10 15:30:21.630316 +1200 GMT | nzdi_df_55 | trigger command
mahuika01.mahuika.nesi.org.nz | 44679 | 2019-09-10 15:30:21.630700 +1200 GMT | nzdi_df_29 | trigger command
mahuika01.mahuika.nesi.org.nz | 44676 | 2019-09-10 15:30:21.631803 +1200 GMT | nzdi_df_2 | skip
mahuika01.mahuika.nesi.org.nz | 44681 | 2019-09-10 15:30:21.633431 +1200 GMT | nzdi_df_87 | trigger command
mahuika01.mahuika.nesi.org.nz | 44679 | 2019-09-10 15:30:21.633591 +1200 GMT | nzdi_df_53 | trigger command
mahuika01.mahuika.nesi.org.nz | 44676 | 2019-09-10 15:30:21.635862 +1200 GMT | nzdi_df_26 | trigger command
mahuika01.mahuika.nesi.org.nz | 44679 | 2019-09-10 15:30:21.636186 +1200 GMT | nzdi_df_61 | trigger command
mahuika01.mahuika.nesi.org.nz | 44681 | 2019-09-10 15:30:21.636323 +1200 GMT | nzdi_df_95 | trigger command
mahuika01.mahuika.nesi.org.nz | 44676 | 2019-09-10 15:30:21.638838 +1200 GMT | nzdi_df_34 | trigger command
mahuika01.mahuika.nesi.org.nz | 44679 | 2019-09-10 15:30:21.639058 +1200 GMT | nzdi_df_77 | trigger command
mahuika01.mahuika.nesi.org.nz | 44679 | 2019-09-10 15:30:21.641579 +1200 GMT | nzdi_df_85 | trigger commandmahuika01.mahuika.nesi.org.nz | 44676 | 2019-09-10 15:30:21.641566 +1200 GMT | nzdi_df_42 | trigger command

mahuika01.mahuika.nesi.org.nz | 44679 | 2019-09-10 15:30:21.644107 +1200 GMT | nzdi_df_93 | trigger command
mahuika01.mahuika.nesi.org.nz | 44676 | 2019-09-10 15:30:21.644819 +1200 GMT | nzdi_df_82 | trigger command
mahuika01.mahuika.nesi.org.nz | 44676 | 2019-09-10 15:30:21.647510 +1200 GMT | nzdi_df_90 | trigger command
mahuika01.mahuika.nesi.org.nz | 44675 | 2019-09-10 15:30:21.652359 +1200 GMT | nzdi_df | skip
mahuika01.mahuika.nesi.org.nz | 44675 | 2019-09-10 15:30:21.657043 +1200 GMT | nzdi_df_17 | trigger command
mahuika01.mahuika.nesi.org.nz | 44675 | 2019-09-10 15:30:21.660567 +1200 GMT | nzdi_df_33 | trigger command
mahuika01.mahuika.nesi.org.nz | 44675 | 2019-09-10 15:30:21.663843 +1200 GMT | nzdi_df_65 | trigger command
mahuika01.mahuika.nesi.org.nz | 44675 | 2019-09-10 15:30:21.666972 +1200 GMT | nzdi_df_89 | trigger command
mahuika01.mahuika.nesi.org.nz | 44675 | 2019-09-10 15:30:21.669829 +1200 GMT | nzdi_df_97 | trigger command
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.671193 +1200 GMT |  | find more outdated targets
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.673549 +1200 GMT |  | find downstream outdated targets
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.679045 +1200 GMT |  | end outdated()
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.679639 +1200 GMT |  | isolate oudated targets
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.687265 +1200 GMT |  | construct priority queue
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.723451 +1200 GMT | nzdi_df_3 | trigger command
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.937588 +1200 GMT |  | setting common data
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:21.951055 +1200 GMT |  | begin scheduling targets
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:30.605054 +1200 GMT |  | sending common data
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:30.641634 +1200 GMT |  | sending common data
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:30.927170 +1200 GMT | nzdi_df_3 | trigger command
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:30.929628 +1200 GMT | nzdi_df_3 | target nzdi_df_3
wbn205 | 19557 | 2019-09-10 15:30:30.972838 +1200 GMT | nzdi_df_3 | build on an hpc worker
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:31.295992 +1200 GMT | nzdi_df_4 | trigger command
mahuika01.mahuika.nesi.org.nz | 42466 | 2019-09-10 15:30:31.299542 +1200 GMT | nzdi_df_4 | target nzdi_df_4
wbn209 | 119536 | 2019-09-10 15:30:31.395678 +1200 GMT | nzdi_df_4 | build on an hpc worker
wbn205 | 19557 | 2019-09-10 16:52:09.370822 +1200 GMT | nzdi_df_3 | store
wbn205 | 19557 | 2019-09-10 16:52:09.587016 +1200 GMT | nzdi_df_3 | time  4896.886s (~1.36 hours) | 4872.732s (~1.35 hours)  (exec | total)
wbn209 | 119536 | 2019-09-10 17:08:17.383939 +1200 GMT | nzdi_df_4 | store
wbn209 | 119536 | 2019-09-10 17:08:17.558529 +1200 GMT | nzdi_df_4 | time  5863.054s (~1.63 hours) | 5840.764s (~1.62 hours)  (exec | total)

kendonB on 10 Sep 2019

See above, @wlandau, the nzdi_df_3 target finished 6 minutes ago and the master still hasn't sent the next target for the worker to make.

I feel like I recall that the clustermq worker has a 10 minute timeout after being done and not receiving anything from the master.

Is there an exponential backoff on the master that is just backing off way too much?

kendonB on 10 Sep 2019

:wave: Hey @kendonB...

Letting you know, @wlandau is currently OOO until Thursday, September 12th 2019. :heart:

ooo[bot] on 10 Sep 2019

🎉1

I have updated the console.log file above. The worker logs are essentially the same as the earlier one above

kendonB on 10 Sep 2019

Looks as though clustermq is timing out, as you suggested. Maybe increase the timeout in your template file as recommended at https://github.com/mschubert/clustermq/issues/146#issuecomment-508564778. Maybe I should set timeout = Inf as the default in all of drake's templates...

#!/bin/sh
# From https://github.com/mschubert/clustermq/wiki/SLURM
#SBATCH --job-name={{ job_name }}           # job name
#SBATCH --partition=default                 # partition
#SBATCH --output={{ log_file | /dev/null }} # you can add .%a for array index
#SBATCH --error={{ log_file | /dev/null }}  # log file
#SBATCH --mem-per-cpu={{ memory | 4096 }}   # memory
#SBATCH --array=1-{{ n_jobs }}              # job array
# module load R                             # Uncomment if R is an environment module.
ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}", timeout = {{ timeout | Inf }})'

cc @mschubert

wlandau on 10 Sep 2019

Do you think that cmq is timing out on the master as well as the workers? If not, it's surprising that 10 minutes can go by with the worker not being able to tell the master that it's ready for more work. If the master is timing out, shouldn't there be a message to that effect? Perhaps the console log could also include periodic (e.g. every 5 minutes) messages saying that an idle master / worker is still alive as expected.

kendonB on 10 Sep 2019

There is a worker timeout in case the master crashes, and the master timeout in case workers crash.

By default, the master timeout is Inf and the worker timeout 10 minutes. Both are configurable, so depending on what you do you may want to set them.

In any case, socket timeouts should go away in favor of disconnect monitors in v0.9, but that is not possible with rzmq and hence depends on replacing the bindings.

edit: Reading your original post, it's strange that the master does not receive the completion message from the worker. I routinely ran multi-hour worker calls, so that should in itself work. Can you reproduce with cmq alone?

mschubert on 10 Sep 2019

Here is my slurm template for reference / clues. I'm guessing I wasn't able to get the R call to work on my system but using Rscript is probably not the culprit right?

#!/bin/sh
#SBATCH --job-name={{ job_name }}
#SBATCH --output={{ log_file | /dev/null }}%a # you can add .%a for array index
#SBATCH --mem-per-cpu={{ memory | 4096 }}
#SBATCH --array=1-{{ n_jobs }}
#SBATCH --account=landcare00063
#SBATCH --time={{ walltime }}
#SBATCH --partition={{ partition | large}}
source ~/.bashrc
source /etc/profile
## Export value of DEBUGME environemnt var to slave
## export DEBUGME=<%= Sys.getenv("DEBUGME") %>

ulimit -v $(( 1024 * {{ memory | 4096 }} ))
Rscript -e 'clustermq:::worker("{{ master }}", verbose = TRUE)'
## R --no-save --no-restore -e 'ulimit::memory_limit({{ memory | 4096 }}); clustermq:::worker("{{ master }}")'

@mschubert, I haven't tried to see if the problem persists using a pure clustermq solution. Any code/pseudo-code to help me do this is welcome but I'm sure I can figure it out from the docs.

kendonB on 10 Sep 2019

clustermq::worker(timeout = Inf) in the template seems like a stopgap. But maybe the drake's main event loop should handle the worker errors directly. Right now we have

https://github.com/ropensci/drake/blob/a6f7fa7a00af97dc872ab20154cfd5659a9f81f4/R/backend_clustermq.R#L71-L89

Maybe it should be

cmq_master <- function(config) { 
   on.exit(config$workers$finalize()) 
   log_msg("begin scheduling targets", config = config) 
   while (config$counter$remaining > 0) { 
     msg <- config$workers$receive_data() 
     cmq_conclude_build(msg = msg, config = config)
     if (identical(msg$id, "WORKER_ERROR")) {
       stop("worker failed", call. = FALSE)
     }
     if (!identical(msg$token, "set_common_data_token")) { 
       log_msg("sending common data", config = config) 
       config$workers$send_common_data()
     } else if (!config$queue$empty()) { 
       cmq_next_target(config) 
     } else { 
       config$workers$send_shutdown_worker() 
     } 
   } 
   if (config$workers$cleanup()) { 
     on.exit() 
   } 
 }

wlandau on 11 Sep 2019

Just so I'm clear about my understanding of what's going on:

The workers finish their work after around 90 minutes then fail to make contact with the master. After 10 minutes of no contact those workers time out.
The master has Inf timeout by default so should still be there constantly listening. The master unfortunately does not see when the workers are done. In the above case, there are only two workers so it shouldn't take anywhere near 10 minutes to poll two workers to see that they're done?

So, I don't think adding an infinite timeout to the workers is going to solve anything here. It looks like there's something else stopping the communication happening in the first place.

kendonB on 11 Sep 2019

👍1

I agree with @kendonB here. If you give me a reproducible example with cmq alone I can help debug

mschubert on 11 Sep 2019

👍1

Good point, I do not know why it would take 10 whole minutes for a worker to contact the master after it finishes a target. (Certainly the 90 minutes it takes to build a target is longer than 10, but the 10-minute countdown starts after the target finishes, right?)

wlandau on 11 Sep 2019

Yes, I see the timeout happen about 10 minutes after I see the ..2019-09-10 13:23:26.546925 | eval'd: drake::cmq_buildtargetmetadepslayoutconfig at the end of the worker log

kendonB on 11 Sep 2019

Just stepping it through, it's getting stuck here at the beginning:

https://github.com/ropensci/rzmq/blob/bcc4c92f4653181720ea0183ecdaa73215cbad40/R/rzmq.R#L120

in the call to config$workers$receive_data() immediately after the second target is submitted.

I will check back in about an hour to see if it goes past this in the drake loop.

For reference:

https://github.com/ropensci/rzmq/blob/bcc4c92f4653181720ea0183ecdaa73215cbad40/src/interface.cpp#L243-L346

https://github.com/ropensci/drake/blob/a6f7fa7a00af97dc872ab20154cfd5659a9f81f4/R/backend_clustermq.R#L75

kendonB on 11 Sep 2019

It does not go past that point in the R code. So it definitely gets stuck here: https://github.com/ropensci/rzmq/blob/bcc4c92f4653181720ea0183ecdaa73215cbad40/R/rzmq.R#L120

in the call to config$workers$receive_data() immediately after the second target is submitted.

kendonB on 11 Sep 2019

👍1

Maybe something like this? Not yet sure if it can reproduce the problem after we crank up wait.

library(clustermq)
#> * Option 'clustermq.scheduler' not set, defaulting to 'LOCAL'
#> --- see: https://mschubert.github.io/clustermq/articles/userguide.html#configuration
library(rlang)
options(clustermq.scheduler = "multicore")

test_run <- function(wait = 0.1) {
  workers <- workers(n_jobs = 1L)
  on.exit(workers$finalize())
  workers$set_common_data(
    export = list(),
    fun = identity,
    const = list(),
    rettype = list(),
    pkgs = character(0),
    common_seed = 0L,
    token = "set_common_data_token"
  )
  main_loop(workers = workers, wait = wait)
}

main_loop <- function(workers, wait) {
  counter <- 3L
  while (counter > 0L) {
    print(counter)
    msg <- workers$receive_data()
    if (!is.null(msg$result)) {
      counter <- counter - 1L
    }
    if (!identical(msg$token, "set_common_data_token")) {
      workers$send_common_data()
    }
    else if (counter > 0L) {
      workers$send_call(
        expr = c(Sys.sleep(wait), 123),
        env = list(wait = wait)
      )
    } else {
      workers$send_shutdown_worker()
    }
  }
}

test_run(wait = 0.1)
#> [1] 3
#> [1] 3
#> [1] 3
#> [1] 2
#> [1] 1
#> Warning in workers$finalize(): Unclean shutdown for PIDs: 18389

^{Created on 2019-09-11 by the reprex package (v0.3.0)}

wlandau on 12 Sep 2019

I have this running now with a 90 minute wait time.

kendonB on 12 Sep 2019

Can confirm I get the same behaviour as before; as in, the worker finishes up the work and the master process hangs.

kendonB on 12 Sep 2019

In that case, should we move this issue to clustermq? (Maybe rzmq?) @mschubert, is this something that might resolve itself when you switch to custom ZeroMQ bindings?

@kendonB, what versions of clustermq, rzmq, and ZeroMQ are you using? And when you ran https://github.com/ropensci/drake/issues/1005#issuecomment-530637368, were you using the multicore backend or SLURM?

wlandau on 12 Sep 2019

I was using SLURM. I can also try using multicore tomorrow.

kendonB on 12 Sep 2019

I've tried your example and both multicore and slurm finish without errors for me.

What's your sessionInfo()?

mschubert on 12 Sep 2019

Same here. I cranked up the wait time to 90 minutes, and the master process completed normally in a reasonable 4.5 hours on both multicore and SGE. I am using ZeroMQ version 4.2.3, and my own sessionInfo() is below.

@kendonB, I am closing this issue because I believe the problem comes from your setup and not drake itself, but I will still try to help if I can.

R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS:   CENSORED/R-3.6.0/lib64/R/lib/libRblas.so
LAPACK: CENSORED/R-3.6.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] clustermq_0.8.8

loaded via a namespace (and not attached):
[1] compiler_3.6.0 R6_2.4.0       tools_3.6.0    parallel_3.6.0 rzmq_0.9.6

wlandau on 12 Sep 2019

I misunderstood what the script does. I shut mine down before 4.5 hours thinking I should expect it to be done in 90 minutes. I will try it again. My example was failing with 2 workers running 2 jobs (1 each) that each take around 90 minutes and return NULL. So I'll adjust the test code to have those properties.

kendonB on 12 Sep 2019

I'm on ZeroMQ 4.2.5 and session info is:

```r

devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 3.6.1 (2019-07-05)
os CentOS Linux 7 (Core)
system x86_64, linux-gnu
ui X11
language (EN)
collate en_NZ.UTF-8
ctype en_NZ.UTF-8
tz NZ
date 2019-09-13

─ Packages ───────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [2] CRAN (R 3.6.1)
backports 1.1.4 2019-04-10 [2] CRAN (R 3.6.1)
callr 3.3.0 2019-07-04 [2] CRAN (R 3.6.1)
cli 1.1.0 2019-03-19 [2] CRAN (R 3.6.1)
clustermq * 0.8.8 2019-06-05 [1] CRAN (R 3.6.1)
crayon 1.3.4 2017-09-16 [2] CRAN (R 3.6.1)
desc 1.2.0 2018-05-01 [2] CRAN (R 3.6.1)
devtools 2.1.0 2019-07-06 [1] CRAN (R 3.6.1)
digest 0.6.20 2019-07-04 [2] CRAN (R 3.6.1)
fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1)
glue 1.3.1 2019-03-12 [2] CRAN (R 3.6.1)
magrittr 1.5 2014-11-22 [2] CRAN (R 3.6.1)
memoise 1.1.0 2017-04-21 [2] CRAN (R 3.6.1)
pkgbuild 1.0.3 2019-03-20 [2] CRAN (R 3.6.1)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.1)
prettyunits 1.0.2 2015-07-13 [2] CRAN (R 3.6.1)
processx 3.4.0 2019-07-03 [2] CRAN (R 3.6.1)
ps 1.3.0 2018-12-21 [2] CRAN (R 3.6.1)
R6 2.4.0 2019-02-14 [2] CRAN (R 3.6.1)
Rcpp 1.0.2 2019-07-25 [1] CRAN (R 3.6.1)
remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.1)
rlang 0.4.0 2019-06-25 [2] CRAN (R 3.6.1)
rprojroot 1.3-2 2018-01-03 [2] CRAN (R 3.6.1)
rzmq * 0.9.6 2019-03-01 [1] CRAN (R 3.6.1)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.1)
testthat 2.2.1 2019-07-25 [1] CRAN (R 3.6.1)
usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.1)
withr 2.1.2 2018-03-15 [2] CRAN (R 3.6.1)```

kendonB on 12 Sep 2019

Deleted a comment as I got test processes mixed up. Double checking now that it actually does work in the minimal clustermq example.

kendonB on 13 Sep 2019

I guess I should test again with 2 workers.

wlandau on 13 Sep 2019

OK apologies. I have now run this exact script:

library(clustermq)
#> * Option 'clustermq.scheduler' not set, defaulting to 'LOCAL'
#> --- see: https://mschubert.github.io/clustermq/articles/userguide.html#configuration
library(rlang)
options(clustermq.scheduler = "slurm", 
        clustermq.template = "slurm_clustermq.tmpl")

test_run <- function(wait = 0.1) {
  workers <- workers(n_jobs = 2L, template = 
                       list(
                         log_file = "make.log", 
                         memory = 500,
                         walltime = 250
                       ))
  on.exit(workers$finalize())
  workers$set_common_data(
    export = list(),
    fun = identity,
    const = list(),
    rettype = list(),
    pkgs = character(0),
    common_seed = 0L,
    token = "set_common_data_token"
  )
  main_loop(workers = workers, wait = wait)
}

main_loop <- function(workers, wait) {
  counter <- 2L
  while (counter > 0L) {
    print(counter)
    msg <- workers$receive_data()
    if (!is.null(msg$result)) {
      counter <- counter - 1L
    }
    if (!identical(msg$token, "set_common_data_token")) {
      workers$send_common_data()
    }
    else if (counter > 0L) {
      workers$send_call(
        expr = {
          print("I'm working.")
          c(Sys.sleep(wait), 123)
          NULL
        },
        env = list(wait = wait)
      )
    } else {
      workers$send_shutdown_worker()
    }
  }
}

test_run(wait = 3600*1.5)

The master process hangs (presumably at the workers$receive_data() call) and the two worker logs are:

> clustermq:::worker("tcp://mahuika01:7649")
2019-09-13 12:58:54.413862 | Master: tcp://mahuika01:7649
2019-09-13 12:58:54.422901 | WORKER_UP to: tcp://mahuika01:7649
2019-09-13 12:58:54.424276 | > DO_SETUP (0.000s wait)
2019-09-13 12:58:54.424635 | token from msg: set_common_data_token
2019-09-13 12:58:54.449676 | > DO_CALL (0.000s wait)
[1] "I'm working."
2019-09-13 14:28:54.486711 | eval'd: {print("I'm working.")c(Sys.sleep(wait), 123)NULL
Error in clustermq:::worker("tcp://mahuika01:7649") :
  Timeout reached, terminating
Execution halted

and

> clustermq:::worker("tcp://mahuika01:7649")
2019-09-13 12:58:55.203638 | Master: tcp://mahuika01:7649
2019-09-13 12:58:55.217463 | WORKER_UP to: tcp://mahuika01:7649
2019-09-13 12:58:55.218325 | > DO_SETUP (0.000s wait)
2019-09-13 12:58:55.218681 | token from msg: set_common_data_token
2019-09-13 12:58:55.226476 | > DO_CALL (0.001s wait)
[1] "I'm working."
2019-09-13 14:28:55.327345 | eval'd: {print("I'm working.")c(Sys.sleep(wait), 123)NULL
Error in clustermq:::worker("tcp://mahuika01:7649") :
  Timeout reached, terminating
Execution halted

and on the master I see:

[1] 2
[1] 2
[1] 2
[1] 2
[1] 2

My template file is:

#!/bin/sh
#SBATCH --job-name={{ job_name }}
#SBATCH --output={{ log_file | /dev/null }}%a # you can add .%a for array index
#SBATCH --mem-per-cpu={{ memory | 4096 }}
#SBATCH --array=1-{{ n_jobs }}
#SBATCH --account=landcare00063
#SBATCH --time={{ walltime }}
#SBATCH --partition={{ partition | large}}
source ~/.bashrc
source /etc/profile
## Export value of DEBUGME environemnt var to slave
## export DEBUGME=<%= Sys.getenv("DEBUGME") %>

ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

My session info is:

> devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 3.6.1 (2019-07-05)
 os       CentOS Linux 7 (Core)
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_NZ.UTF-8
 ctype    en_NZ.UTF-8
 tz       NZ
 date     2019-09-13

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source
 assertthat    0.2.1   2019-03-21 [2] CRAN (R 3.6.1)
 backports     1.1.4   2019-04-10 [2] CRAN (R 3.6.1)
 callr         3.3.0   2019-07-04 [2] CRAN (R 3.6.1)
 cli           1.1.0   2019-03-19 [2] CRAN (R 3.6.1)
 clustermq   * 0.8.8   2019-06-05 [1] CRAN (R 3.6.1)
 crayon        1.3.4   2017-09-16 [2] CRAN (R 3.6.1)
 desc          1.2.0   2018-05-01 [2] CRAN (R 3.6.1)
 devtools      2.1.0   2019-07-06 [1] CRAN (R 3.6.1)
 digest        0.6.20  2019-07-04 [2] CRAN (R 3.6.1)
 fs            1.3.1   2019-05-06 [1] CRAN (R 3.6.1)
 glue          1.3.1   2019-03-12 [2] CRAN (R 3.6.1)
 magrittr      1.5     2014-11-22 [2] CRAN (R 3.6.1)
 memoise       1.1.0   2017-04-21 [2] CRAN (R 3.6.1)
 pkgbuild      1.0.3   2019-03-20 [2] CRAN (R 3.6.1)
 pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.1)
 prettyunits   1.0.2   2015-07-13 [2] CRAN (R 3.6.1)
 processx      3.4.0   2019-07-03 [2] CRAN (R 3.6.1)
 ps            1.3.0   2018-12-21 [2] CRAN (R 3.6.1)
 R6            2.4.0   2019-02-14 [2] CRAN (R 3.6.1)
 Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.6.1)
 remotes       2.1.0   2019-06-24 [1] CRAN (R 3.6.1)
 rlang       * 0.4.0   2019-06-25 [2] CRAN (R 3.6.1)
 rprojroot     1.3-2   2018-01-03 [2] CRAN (R 3.6.1)
 rzmq          0.9.6   2019-03-01 [1] CRAN (R 3.6.1)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.1)
 testthat      2.2.1   2019-07-25 [1] CRAN (R 3.6.1)
 usethis       1.5.1   2019-07-04 [1] CRAN (R 3.6.1)
 withr         2.1.2   2018-03-15 [2] CRAN (R 3.6.1)

And my ZeroMQ version is:

> rzmq::zmq.version()
[1] "4.2.5"

@wlandau I'm happy to move this off the drake github and onto the clustermq github if you like. @mschubert what other troubleshooting steps could I do since it does seem environmental if this code is working for both of you.

kendonB on 13 Sep 2019

👀1 👍1

@kendonB, thank you for the thorough follow-up. I agree to move this to clustermq.

wlandau on 13 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

clustermq SLURM job slow to get going

kendonB · 27Comments

Dynamic branching

wlandau · 61Comments

DSL based on dplyr-like verbs?

krlmlr · 44Comments

Consider different levels of checks

kendonB · 35Comments

Meta-make

krlmlr · 32Comments