Expect executor to wait until job complete and report success or failure. Should not later find completed output and exit status of 0 for 'failed' jobs.
What seems to happen with some long-running jobs is that the LSF executor decides the job has failed. When I go to the job directory (some time later) I then find the output is there and the exit code file says '0'.
I'm unsure if increasing exitStatusReadTimeoutMillis will help. Is this the same issue as https://github.com/nextflow-io/nextflow/issues/927?
Difficult to say, it's a fairly-non- deterministic error, and I haven't been able to reproduce it with simple 'sleep' tests. But jobs of more than a couple of minutes seem to trigger the issue.
Erroring part of .nextflow.log is:
...
Feb-23 23:13:20.369 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 3137621; id: 1321; name: download_fastqs (1319); status: RUNNING; exit: -; error: -; workDir: <path>/work/E-MTAB-5530/danio_rerio/smartseq/72/1f67ab72a847f8e5191d78d5a5ec3f started: 1550962950317; exited: -; ] -- exitStatusReadTimeoutMillis: 500000; delta: 500045
Current queue status:
> (empty)
Content of workDir: <path>/work/E-MTAB-5530/danio_rerio/smartseq/72/1f67ab72a847f8e5191d78d5a5ec3f
> total 32
> drwxrwxr-x 2 <user <group> 4096 Feb 23 23:02 .
> drwxr-xr-x 6 <user <group> 4096 Feb 23 23:02 ..
> -rw-r--r-- 1 <user <group> 0 Feb 23 23:02 .command.begin
> -rw-r--r-- 1 <user <group> 0 Feb 23 23:02 .command.err
> -rw-r--r-- 1 <user <group> 118 Feb 23 23:03 .command.log
> -rw-r--r-- 1 <user <group> 118 Feb 23 23:03 .command.out
> -rw-r--r-- 1 <user <group> 2183 Feb 23 23:02 .command.run
> -rw-r--r-- 1 <user <group> 197 Feb 23 23:02 .command.sh
> -rw-r--r-- 1 <user <group> 3548 Feb 23 23:02 .command.stub
> -rw-r--r-- 1 <user <group> 158 Feb 23 23:13 .command.trace
Feb-23 23:13:20.370 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 3137621; id: 1321; name: download_fastqs (1319); status: COMPLETED; exit: -; error: -; workDir: <path>/work/E-MTAB-5530/danio_rerio/smartseq/72/1f67ab72a847f8e5191d78d5a5ec3f started: 1550962950317; exited: -; ]
Feb-23 23:13:20.430 [Task monitor] INFO nextflow.processor.TaskProcessor - [72/1f67ab] NOTE: Process `download_fastqs (1319)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
...
Full log to be attached below.
Looking at the log it looks NF cannot retrieve the job .exitcode file.
Feb-23 23:13:20.369 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 3137621; id: 1321; name: download_fastqs (1319); status: RUNNING; exit: -; error: -; workDir: <path>/work/E-MTAB-5530/danio_rerio/smartseq/72/1f67ab72a847f8e5191d78d5a5ec3f started: 1550962950317; exited: -; ] -- exitStatusReadTimeoutMillis: 500000; delta: 500045
Current queue status:
> (empty)
Content of workDir: <path>/work/E-MTAB-5530/danio_rerio/smartseq/72/1f67ab72a847f8e5191d78d5a5ec3f
> total 32
> drwxrwxr-x 2 <user <group> 4096 Feb 23 23:02 .
> drwxr-xr-x 6 <user <group> 4096 Feb 23 23:02 ..
> -rw-r--r-- 1 <user <group> 0 Feb 23 23:02 .command.begin
> -rw-r--r-- 1 <user <group> 0 Feb 23 23:02 .command.err
> -rw-r--r-- 1 <user <group> 118 Feb 23 23:03 .command.log
> -rw-r--r-- 1 <user <group> 118 Feb 23 23:03 .command.out
> -rw-r--r-- 1 <user <group> 2183 Feb 23 23:02 .command.run
> -rw-r--r-- 1 <user <group> 197 Feb 23 23:02 .command.sh
> -rw-r--r-- 1 <user <group> 3548 Feb 23 23:02 .command.stub
> -rw-r--r-- 1 <user <group> 158 Feb 23 23:13 .command.trace
This may happen if the shared file system is not consistent and because it uses a too aggressive caching policy.
When I go to look, that file is in fact there- though obviously it wasn't at run time. So there's some sort of delay. Is there a workaround? Is exitStatusReadTimeoutMillis it?
Indeed, it meant for that, but it looks you already set to > 8 min, not sure that increasing it will help. You can try .
Okay, thanks. I have set it to stupidly long timescales, will see what happens.
@pditommaso I also happened this issue multiple times recently and use SGE, Job was still running on SGE but NF execute qstat and fetch empty result, then nextflow think this job should have been finished and read its exitcode, but exitcode was not exists, NF will throw a ProcessException and resumeOrDie, if we specify errorStrategy retry, NF will resubmit another same job, here is the snapshot:

here is error log snippets:
Feb-26 13:08:05.726 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] getting queue status > cmd: qstat
Feb-26 13:08:05.781 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > cmd exit: 0
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
55366 0.50000 nf-QC_(lib chenqi r 02/26/2019 09:22:57 [email protected] 1
Feb-26 13:08:05.784 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status >
job: 55366: RUNNING
Feb-26 13:08:05.785 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:10.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2706)
Feb-26 13:08:10.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:10.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:10.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:15.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2707)
Feb-26 13:08:15.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:15.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:15.726 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:20.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2708)
Feb-26 13:08:20.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:20.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:20.726 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:25.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2709)
Feb-26 13:08:25.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:25.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:25.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:30.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2710)
Feb-26 13:08:30.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:30.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:30.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:35.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2711)
Feb-26 13:08:35.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:35.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:35.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:40.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2712)
Feb-26 13:08:40.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:40.725 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:40.726 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:45.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2713)
Feb-26 13:08:45.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:45.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:45.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:50.723 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2714)
Feb-26 13:08:50.726 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:50.727 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:50.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:55.723 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2715)
Feb-26 13:08:55.727 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:55.727 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:55.728 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:09:00.723 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2716)
Feb-26 13:09:00.726 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:09:00.727 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:09:00.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:09:05.723 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2717)
Feb-26 13:09:05.728 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:09:05.729 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:09:05.731 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] getting queue status > cmd: qstat
Feb-26 13:09:05.785 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > cmd exit: 0
Feb-26 13:09:05.787 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status >
(empty)
Feb-26 13:09:05.789 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > map does not contain jobId: `55366`
Feb-26 13:09:05.791 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:09:05.792 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exist status for process TaskHandler[jobId: 55366; id: 1; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); status: RUNNING; exit: -; error: -; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972 started: 1551144180132; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 5725168
Current queue status:
> (null)
Content of workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972
null
Feb-26 13:09:05.797 [Task monitor] TRACE g.r.m.CustomMetaClassCreationHandle - Registering custom meta-class for: class java.io.FileReader
Feb-26 13:09:05.850 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - clear complete job 55366 result : true
Feb-26 13:09:05.851 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:09:05.852 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 55366; id: 1; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); status: COMPLETED; exit: -; error: -; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972 started: 1551144180132; exited: -; ]
Feb-26 13:09:05.853 [Task monitor] TRACE nextflow.processor.TaskProcessor - finalizing process > QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003) -- id: 1; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); type: SCRIPTLET; exit: -; error: null; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972
Feb-26 13:09:05.880 [Task monitor] TRACE nextflow.processor.TaskProcessor - Handling unexpected condition for
task: id: 1; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); type: SCRIPTLET; exit: -; error: null; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972
error [nextflow.exception.ProcessFailedException]: Process `QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003)` terminated for an unknown reason -- Likely it has been terminated by the external system
Feb-26 13:09:05.924 [Task monitor] INFO nextflow.processor.TaskProcessor - [fc/64e809] NOTE: Process `QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
Feb-26 13:09:05.928 [Task monitor] TRACE nextflow.processor.TaskHandler - task stats: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); start: 1551144180132; complete: 1551157745794; realtime: 13565662 [13565662];
Feb-26 13:09:05.972 [pool-2-thread-1] TRACE nextflow.processor.TaskProcessor - [QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003)] Cacheable folder=/mnt/beegfs/chenqi/project/marco_fish/02survey/work/e9/17d9bf496fc81c269f6546e1f32f5a -- exists=false; try=2; shouldTryCache=false
Feb-26 13:09:05.973 [pool-2-thread-1] TRACE nextflow.processor.TaskProcessor - [QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003)] actual run folder: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/e9/17d9bf496fc81c269f6546e1f32f5a
Feb-26 13:09:05.974 [pool-2-thread-1] TRACE nextflow.processor.TaskDispatcher - Scheduling process: id: 2; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); type: SCRIPTLET; exit: -; error: null; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/e9/17d9bf496fc81c269f6546e1f32f5a
Feb-26 13:09:05.976 [pool-2-thread-1] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/e9/17d9bf496fc81c269f6546e1f32f5a/.exitcode
Feb-26 13:09:05.977 [pool-2-thread-1] TRACE n.processor.TaskPollingMonitor - Scheduled task > TaskHandler[jobId: null; id: 2; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); status: NEW; exit: -; error: -; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/e9/17d9bf496fc81c269f6546e1f32f5a started: -; exited: -; ]
@Crabime If qstat is not returning the job status NF assumes it's terminated, and therefore kill it after a while.
@pditommaso I think there might be some problem, qstat really had value but ProcessBuilder return empty, Maybe it's a bug of ProcessBuilder
I may have spotted a problem related to the ProcessBuilder. Please have a try with the snapshot I've just uploaded using this command below:
NXF_VER=19.03.0-SNAPSHOT nextflow run .. [your options]
Hello! We are also seeing a very similar error on our slurm cluster. We could potentially try out the snapshot version you're suggesting but it would be easier for us to switch back to an older version of nextflow (no internet on the cluster). Do you have a version you think is from before this issue was introduced?
Thanks!
Sorry for hijacking this issue, but I think we're seeing the same issue at our slurm cluster. We did indeed try out the snapshot version you recommended without success. The error persists with the new version of nextflow running Sarek:
Feb-27 18:15:25.974 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 558269; id: 16; name: MergeBams (DREAM-DREAM_N); status: RUNNING; exit: -; error: -; workDir: /lupus/pro
j/ngi2016004/private/johannes/sarek_run_dirs/test_sarek_configs_20190221/work/45/4511b27f5f994f305befddebdf4dc2 started: 1551287305704; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 270011
Current queue status:
> job: 557807: DONE
> job: 557808: DONE
> job: 557809: DONE
> job: 557804: DONE
> job: 557805: DONE
> job: 557806: DONE
Content of workDir: /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/test_sarek_configs_20190221/work/45/4511b27f5f994f305befddebdf4dc2
null
Feb-27 18:15:25.975 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 558269; id: 16; name: MergeBams (DREAM-DREAM_N); status: COMPLETED; exit: -; error: -; workDir: /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/test_sarek_configs_20190221/work/45/4511b27f5f994f305befddebdf4dc2 started: 1551287305704; exited: -; ]
Feb-27 18:15:26.145 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'MergeBams (DREAM-DREAM_N)' -- Cause: java.nio.file.NoSuchFileException: /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/test_sarek_configs_20190221/work/45/4511b27f5f994f305befddebdf4dc2/.command.out
Feb-27 18:15:26.162 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'MergeBams (DREAM-DREAM_N)' -- Cause: java.nio.file.NoSuchFileException: /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/test_sarek_configs_20190221/work/45/4511b27f5f994f305befddebdf4dc2/.command.err
Feb-27 18:15:26.243 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'MergeBams (DREAM-DREAM_N)'
Caused by:
Process `MergeBams (DREAM-DREAM_N)` terminated for an unknown reason -- Likely it has been terminated by the external system
Command executed:
samtools merge --threads 16 DREAM_N.bam DREAM_N_2.bam DREAM_N_1.bam DREAM_N_4.bam DREAM_N_5.bam DREAM_N_3.bam
Command exit status:
-
But the job finishes as it is supposed to:
$ ls -lha /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/test_sarek_configs_20190221/work/45/4511b27f5f994f305befddebdf4dc2/
total 97G
drwxrwsr-x 2 alneberg ngi2016004 4.0K Feb 27 19:19 .
drwxrwsr-x 3 alneberg ngi2016004 4.0K Feb 27 18:08 ..
-rw-rw-r-- 1 alneberg ngi2016004 0 Feb 27 18:08 .command.begin
-rw-rw-r-- 1 alneberg ngi2016004 0 Feb 27 19:13 .command.err
-rw-rw-r-- 1 alneberg ngi2016004 52 Feb 27 18:08 .command.log
-rw-rw-r-- 1 alneberg ngi2016004 0 Feb 27 19:13 .command.out
-rw-rw-r-- 1 alneberg ngi2016004 12K Feb 27 18:08 .command.run
-rw-rw-r-- 1 alneberg ngi2016004 136 Feb 27 18:08 .command.sh
-rw-r--r-- 1 alneberg ngi2016004 248 Feb 27 19:13 .command.trace
-rw-r--r-- 1 alneberg ngi2016004 97G Feb 27 19:19 DREAM_N.bam
-rw-rw-r-- 1 alneberg ngi2016004 1 Feb 27 19:19 .exitcode
and .exitcode is fine:
$ cat /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/test_sarek_configs_20190221/work/45/4511b27f5f994f305befddebdf4dc2/.exitcode
0
And more detailed info from the slurm command:
2019-02-27 19:20:02 jobid=558269 jobstate=COMPLETED username=alneberg account=ngi2016004 nodes=i171 procs=16 partition=node qos=normal jobname=nf-MergeBams_(DREAM-DREAM_N) maxmemory_in_GiB=0.0 maxmemory_node=i171 timelimit=2-00:00:00 submit_time=2019-02-27T18:08:20 start_time=2019-02-27T18:08:21 end_time=2019-02-27T19:20:02 runtime=01:11:41 margin=1-22:48:19 queuetime=00:00:01
The start time of the job is quite close to 7 minutes before the nextflow process shuts down. Also, the files .command.err and .command.out are empty with quite late modification times. Could any of these be clues to what is going on?
This is different. The issue identified above is that NF fail to retried the queue status output and reports
Current queue status:
> (empty)
Instead in you case the queue status if fetched.
Hmmm, that is indeed different. But the message before is the same:
Failed to get exit status for proces... even though the process is running...
We did see the (empty) queue status last week on the old version. Do you think the nextflow snapshot version fixed this issue?
I think the (empty) status is fixed by the 19.03.0-SNAPSHOT, waiting a confirmation from other users.
The problem you are reporting it may happen when the file system caching is too eager and NF can't see the .exitcode file. In the log file there's should be the directory listing of the failing task. If you think it's a different problem please report as a separate issue.
@pinin4fjords @Crabime Does the uploaded patch solves the problem with the empty queue status problem?
Sorry @pditommaso I've been away.
The issue occurred non-deterministically for me, so it might take a while to crop up again anyway, but I'm running with the patch now so I'll let you know.
@pditommaso So far I've not seen this error again.
Excellent.
@pditommaso we were just checking in to see if this has been reported, awesome to see that it appears to be addressed! We'll check this out as well.
@pditommaso should we grab the 19.03.0-SNAPSHOT snapshot from the GitHub master branch? I'm not seeing a release tagged: https://github.com/nextflow-io/nextflow/releases
Not completely sure it's solved, was a bit random for me anyway.
If you just run with NXF_VER as above NF should download the update itself
@cjfields snapshots are not uploaded as GH releases. You can use it as mentioned above.
@pinin4fjords include the log file if it still occurs.
@pditommaso thx, got snapshot mixed up w/ 'edge'
I'll upload march edge release next week including this patch.
We have seen this error again in v19.07.0 on LSF
Here is the log .nextflow.log and one set of .command.* and .exitcode for an error job.