Nextflow: K8s execution terminates abruptly if a pod requires more resources than available ones

Created on 2 May 2018  路  24Comments  路  Source: nextflow-io/nextflow

Hi Paolo,

We created the following persistent volume claim in our k8s cluster:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: testpvc
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 50Gi

We mounted a gluster volume to it and put some data inside:

ubuntu@vlad-k8s-test-k8s-master-nf-1:~$ tree /mnt/gluster/pvc-319c8c17-3ca6-11e8-89b1-fa163e31bb09
/mnt/gluster/pvc-319c8c17-3ca6-11e8-89b1-fa163e31bb09
`-- data
    |-- 22028_2#111_1.fastq.gz
    |-- 22028_2#111_2.fastq.gz
    |-- 22028_2#11_1.fastq.gz
    |-- 22028_2#11_2.fastq.gz
    |-- 22028_2#118_1.fastq.gz
    `-- 22028_2#118_2.fastq.gz

1 directory, 6 files

Then I ran the nf-core/rnaseq pipeline, which by default consumes fastq.gz files in the data folder and got this:

ubuntu@k8s-master:~/kubespray$ ./nextflow kuberun nf-core/rnaseq --genome GRCh37 -v testpvc:/mnt/gluster/
Pod started: nice-jang
N E X T F L O W  ~  version 0.28.0
Pulling nf-core/rnaseq ...
 downloaded from https://github.com/nf-core/rnaseq.git
Launching `nf-core/rnaseq` [nice-jang] - revision: 98ffbce439 [master]
===================================
 nfcore/rnaseq  ~  version 1.5dev
===================================
ERROR ~ Cannot find any reads matching: data/*{1,2}.fastq.gz
NB: Path needs to be enclosed in quotes!
NB: Path requires at least one * wildcard!
If this is single-end data, please specify --singleEnd on the command line.

 -- Check '.nextflow.log' file for details

Then I looked again in the mounted volume:

ubuntu@vlad-k8s-test-k8s-master-nf-1:~$ tree /mnt/gluster/pvc-319c8c17-3ca6-11e8-89b1-fa163e31bb09
/mnt/gluster/pvc-319c8c17-3ca6-11e8-89b1-fa163e31bb09
|-- data
|   |-- 22028_2#111_1.fastq.gz
|   |-- 22028_2#111_2.fastq.gz
|   |-- 22028_2#11_1.fastq.gz
|   |-- 22028_2#11_2.fastq.gz
|   |-- 22028_2#118_1.fastq.gz
|   `-- 22028_2#118_2.fastq.gz
|-- projects
|   `-- nf-core
|       `-- rnaseq
|           |-- assets
|           |   |-- biotypes_header.txt
|           |   |-- email_template.html
|           |   |-- email_template.txt
|           |   |-- heatmap_header.txt
|           |   |-- mdsplot_header.txt
|           |   |-- nfcore-rnaseq_logo.png
|           |   |-- NGI_logo.png
|           |   |-- SciLifeLab_logo.png
|           |   |-- sendmail_template.txt
|           |   `-- where_are_my_files.txt
|           |-- bin
|           |   |-- dupRadar.r
|           |   |-- edgeR_heatmap_MDS.r
|           |   |-- gtf2bed
|           |   |-- markdown_to_html.r
|           |   |-- merge_featurecounts.py
|           |   |-- RNA-pipeline-from-BAM.sh
|           |   `-- scrape_software_versions.py
|           |-- CHANGELOG.md
|           |-- conf
|           |   |-- aws.config
|           |   |-- base.config
|           |   |-- binac.config
|           |   |-- ccga.config
|           |   |-- cfc.config
|           |   |-- docker.config
|           |   |-- hebbe.config
|           |   |-- igenomes.config
|           |   |-- multiqc_config.yaml
|           |   |-- singularity.config
|           |   |-- uct_hex.config
|           |   |-- uppmax.config
|           |   |-- uppmax-devel.config
|           |   `-- uppmax-modules.config
|           |-- Dockerfile
|           |-- docs
|           |   |-- configuration
|           |   |   |-- adding_your_own.md
|           |   |   |-- aws.md
|           |   |   |-- c3se.md
|           |   |   |-- ccga.md
|           |   |   |-- local.md
|           |   |   |-- qbic.md
|           |   |   |-- reference_genomes.md
|           |   |   `-- uppmax.md
|           |   |-- images
|           |   |   |-- cutadapt_plot.png
|           |   |   |-- dupRadar_plot.png
|           |   |   |-- featureCounts_assignment_plot.png
|           |   |   |-- featureCounts_biotype_plot.png
|           |   |   |-- heatmap.png
|           |   |   |-- IKMB_logo.png
|           |   |   |-- infer_experiment.png
|           |   |   |-- inner_distance_concept.png
|           |   |   |-- junction_saturation.png
|           |   |   |-- mqc_hcplot_hocmzpdjsq.png
|           |   |   |-- mqc_hcplot_ltqchiyxfz.png
|           |   |   |-- mqc_hcplot_wtnqrdhkuc.png
|           |   |   |-- nfcore-rnaseq_logo.ai
|           |   |   |-- nfcore-rnaseq_logo.png
|           |   |   |-- NGI_logo.png
|           |   |   |-- preseq_complexity_curve.png
|           |   |   |-- preseq_plot.png
|           |   |   |-- QBiC_logo.png
|           |   |   |-- read_duplication.png
|           |   |   |-- rseqc_gene_body_coverage_plot.png
|           |   |   |-- rseqc_infer_experiment_plot.png
|           |   |   |-- rseqc_inner_distance_plot.png
|           |   |   |-- rseqc_junction_annotation_junctions_plot.png
|           |   |   |-- rseqc_junction_saturation_plot.png
|           |   |   |-- rseqc_read_distribution_plot.png
|           |   |   |-- rseqc_read_dups_plot.png
|           |   |   |-- saturation.png
|           |   |   |-- SciLifeLab_logo.png
|           |   |   `-- star_alignment_plot.png
|           |   |-- installation.md
|           |   |-- output.md
|           |   |-- README.md
|           |   |-- troubleshooting.md
|           |   `-- usage.md
|           |-- environment.yml
|           |-- LICENSE.md
|           |-- main.nf
|           |-- nextflow.config
|           |-- README.md
|           `-- tests
|               |-- install.sh
|               |-- run_test.sh
|               `-- test_data
|                   `-- ngi-rna_test_set.tar.bz2
`-- ubuntu
    |-- nextflow.config
    |-- results
    |   `-- pipeline_info
    |       `-- nfcore-rnaseq_trace.txt
    `-- work

16 directories, 91 files

So, NF was using the persistent volume claim, but didn't find the data folder. Here is the .nextflow.log:

May-02 16:25:26.503 [main] DEBUG nextflow.cli.Launcher - $> /usr/local/bin/nextflow run nf-core/rnaseq -name nice-jang --genome GRCh37
May-02 16:25:26.580 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 0.28.0
May-02 16:25:26.673 [main] INFO  nextflow.cli.CmdRun - Pulling nf-core/rnaseq ...
May-02 16:25:26.688 [main] DEBUG nextflow.scm.RepositoryProvider - Request [credentials -:-] -> https://api.github.com/repos/nf-core/rnaseq/contents/nextflow.config
May-02 16:25:27.505 [main] DEBUG nextflow.scm.RepositoryProvider - Request [credentials -:-] -> https://api.github.com/repos/nf-core/rnaseq/contents/main.nf
May-02 16:25:27.799 [main] DEBUG nextflow.scm.RepositoryProvider - Request [credentials -:-] -> https://api.github.com/repos/nf-core/rnaseq
May-02 16:25:27.946 [main] DEBUG nextflow.scm.AssetManager - Pulling nf-core/rnaseq -- Using remote clone url: https://github.com/nf-core/rnaseq.git
May-02 16:25:36.047 [main] INFO  nextflow.cli.CmdRun -  downloaded from https://github.com/nf-core/rnaseq.git
May-02 16:25:36.252 [main] DEBUG nextflow.scm.AssetManager - Git config: /mnt/gluster/projects/nf-core/rnaseq/.git/config; branch: master; remote: origin; url: https://github.com/nf-core/rnaseq.git
May-02 16:25:36.253 [main] INFO  nextflow.cli.CmdRun - Launching `nf-core/rnaseq` [nice-jang] - revision: 98ffbce439 [master]
May-02 16:25:36.263 [main] DEBUG nextflow.config.ConfigBuilder - Found config base: /mnt/gluster/projects/nf-core/rnaseq/nextflow.config
May-02 16:25:36.264 [main] DEBUG nextflow.config.ConfigBuilder - Found config local: nextflow.config
May-02 16:25:36.265 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /mnt/gluster/projects/nf-core/rnaseq/nextflow.config
May-02 16:25:36.266 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /mnt/gluster/ubuntu/nextflow.config
May-02 16:25:36.273 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
May-02 16:25:36.529 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
May-02 16:25:36.577 [main] DEBUG nextflow.Session - Session uuid: a72f4bb2-f8ad-43b0-a93c-d2e66f391571
May-02 16:25:36.577 [main] DEBUG nextflow.Session - Run name: nice-jang
May-02 16:25:36.578 [main] DEBUG nextflow.Session - Executor pool size: 8
May-02 16:25:36.586 [main] DEBUG nextflow.cli.CmdRun -
  Version: 0.28.0 build 4779
  Modified: 10-03-2018 12:13 UTC (12:13 GMT)
  System: Linux 4.4.0-116-generic
  Runtime: Groovy 2.4.13 on OpenJDK 64-Bit Server VM 1.8.0_151-b12
  Encoding: UTF-8 (UTF-8)
  Process: 9@nice-jang [10.233.64.46]
  CPUs: 8 - Mem: 69.9 GB (62.2 GB) - Swap: 0 (0)
May-02 16:25:36.609 [main] DEBUG nextflow.Session - Work-dir: /mnt/gluster/ubuntu/work [fuseblk]
May-02 16:25:36.779 [main] DEBUG nextflow.Session - Session start invoked
May-02 16:25:36.782 [main] DEBUG nextflow.processor.TaskDispatcher - Dispatcher > start
May-02 16:25:36.783 [main] DEBUG nextflow.trace.TraceFileObserver - Flow starting -- trace file: /mnt/gluster/ubuntu/results/pipeline_info/nfcore-rnaseq_trace.txt
May-02 16:25:36.801 [main] DEBUG nextflow.script.ScriptRunner - > Script parsing
May-02 16:25:37.174 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
May-02 16:25:37.204 [main] DEBUG nextflow.file.FileHelper - Creating a file system instance for provider: S3FileSystemProvider
May-02 16:25:37.212 [main] DEBUG nextflow.file.FileHelper - AWS S3 config details: {}
May-02 16:25:37.492 [main] DEBUG nextflow.Channel - files for syntax: glob; folder: data/; pattern: *{1,2}.fastq.gz; options: [:]
May-02 16:25:37.543 [Thread-2] DEBUG nextflow.Channel - No such file: data/ -- Skipping visit
May-02 16:25:37.546 [main] INFO  nextflow.Nextflow - ===================================
May-02 16:25:37.547 [main] INFO  nextflow.Nextflow -  nfcore/rnaseq  ~  version 1.5dev
May-02 16:25:37.548 [main] INFO  nextflow.Nextflow - ===================================
May-02 16:25:37.549 [Actor Thread 2] ERROR nextflow.Nextflow - Cannot find any reads matching: data/*{1,2}.fastq.gz
NB: Path needs to be enclosed in quotes!
NB: Path requires at least one * wildcard!
If this is single-end data, please specify --singleEnd on the command line.

Then, as you recommended, I rerun NF providing the full path, but I got a different error this time:

ubuntu@k8s-master:~/kubespray$ ./nextflow kuberun nf-core/rnaseq --genome GRCh37 --reads '/mnt/gluster/data/*{1,2}.fastq.gz' -v testpvc:/mnt/gluster/
Pod started: cheesy-morse
N E X T F L O W  ~  version 0.28.0
Launching `nf-core/rnaseq` [cheesy-morse] - revision: 98ffbce439 [master]
===================================
 nfcore/rnaseq  ~  version 1.5dev
===================================
ERROR ~ No such variable: USER

 -- Check script 'main.nf' at line: 239 or see '.nextflow.log' file for more details

The .nextflow.log in this case is:

May-02 16:40:26.033 [main] DEBUG nextflow.cli.Launcher - $> /usr/local/bin/nextflow run nf-core/rnaseq -name cheesy-morse --genome GRCh37 --reads /mnt/gluster/data/*{1,2}.fastq.gz
May-02 16:40:26.136 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 0.28.0
May-02 16:40:26.498 [main] DEBUG nextflow.scm.AssetManager - Git config: /mnt/gluster/projects/nf-core/rnaseq/.git/config; branch: master; remote: origin; url: https://github.com/nf-core/rnaseq.git
May-02 16:40:26.568 [main] DEBUG nextflow.scm.AssetManager - Git config: /mnt/gluster/projects/nf-core/rnaseq/.git/config; branch: master; remote: origin; url: https://github.com/nf-core/rnaseq.git
May-02 16:40:26.789 [main] DEBUG nextflow.scm.AssetManager - Git config: /mnt/gluster/projects/nf-core/rnaseq/.git/config; branch: master; remote: origin; url: https://github.com/nf-core/rnaseq.git
May-02 16:40:26.789 [main] INFO  nextflow.cli.CmdRun - Launching `nf-core/rnaseq` [cheesy-morse] - revision: 98ffbce439 [master]
May-02 16:40:29.993 [main] DEBUG nextflow.config.ConfigBuilder - Found config base: /mnt/gluster/projects/nf-core/rnaseq/nextflow.config
May-02 16:40:29.994 [main] DEBUG nextflow.config.ConfigBuilder - Found config local: nextflow.config
May-02 16:40:29.995 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /mnt/gluster/projects/nf-core/rnaseq/nextflow.config
May-02 16:40:29.995 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /mnt/gluster/ubuntu/nextflow.config
May-02 16:40:30.003 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
May-02 16:40:30.250 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
May-02 16:40:30.298 [main] DEBUG nextflow.Session - Session uuid: f78746f8-da81-43c1-81b5-8280c504b6c5
May-02 16:40:30.298 [main] DEBUG nextflow.Session - Run name: cheesy-morse
May-02 16:40:30.299 [main] DEBUG nextflow.Session - Executor pool size: 8
May-02 16:40:30.307 [main] DEBUG nextflow.cli.CmdRun -
  Version: 0.28.0 build 4779
  Modified: 10-03-2018 12:13 UTC (12:13 GMT)
  System: Linux 4.4.0-116-generic
  Runtime: Groovy 2.4.13 on OpenJDK 64-Bit Server VM 1.8.0_151-b12
  Encoding: UTF-8 (UTF-8)
  Process: 10@cheesy-morse [10.233.64.50]
  CPUs: 8 - Mem: 69.9 GB (62.3 GB) - Swap: 0 (0)
May-02 16:40:30.328 [main] DEBUG nextflow.Session - Work-dir: /mnt/gluster/ubuntu/work [fuseblk]
May-02 16:40:30.482 [main] DEBUG nextflow.Session - Session start invoked
May-02 16:40:30.485 [main] DEBUG nextflow.processor.TaskDispatcher - Dispatcher > start
May-02 16:40:30.486 [main] DEBUG nextflow.trace.TraceFileObserver - Flow starting -- trace file: /mnt/gluster/ubuntu/results/pipeline_info/nfcore-rnaseq_trace.txt
May-02 16:40:30.514 [main] DEBUG nextflow.script.ScriptRunner - > Script parsing
May-02 16:40:30.870 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
May-02 16:40:30.900 [main] DEBUG nextflow.file.FileHelper - Creating a file system instance for provider: S3FileSystemProvider
May-02 16:40:30.907 [main] DEBUG nextflow.file.FileHelper - AWS S3 config details: {}
May-02 16:40:31.186 [main] DEBUG nextflow.Channel - files for syntax: glob; folder: /mnt/gluster/data/; pattern: *{1,2}.fastq.gz; options: [:]
May-02 16:40:31.196 [main] INFO  nextflow.Nextflow - ===================================
May-02 16:40:31.196 [main] INFO  nextflow.Nextflow -  nfcore/rnaseq  ~  version 1.5dev
May-02 16:40:31.198 [main] INFO  nextflow.Nextflow - ===================================
May-02 16:40:31.201 [main] DEBUG nextflow.Session - Session aborted -- Cause: No such property: USER for class: _nf_script_0c6d925f
May-02 16:40:31.224 [main] DEBUG nextflow.Session - The following nodes are still active:
  [operator] ifEmpty
  [operator] into

May-02 16:40:31.230 [main] ERROR nextflow.cli.Launcher - @unknown
groovy.lang.MissingPropertyException: No such property: USER for class: _nf_script_0c6d925f
        at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.unwrap(ScriptBytecodeAdapter.java:53)
        at org.codehaus.groovy.runtime.callsite.PogoGetPropertySite.getProperty(PogoGetPropertySite.java:52)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callGroovyObjectGetProperty(AbstractCallSite.java:307)
        at _nf_script_0c6d925f.run(_nf_script_0c6d925f:239)
        at nextflow.script.ScriptRunner.run(ScriptRunner.groovy:341)
        at nextflow.script.ScriptRunner.execute(ScriptRunner.groovy:165)
        at nextflow.cli.CmdRun.run(CmdRun.groovy:223)
        at nextflow.cli.Launcher.run(Launcher.groovy:428)
        at nextflow.cli.Launcher.main(Launcher.groovy:582)

Do you have any idea what the problem can be? Sorry for such a long text...

platfork8s prhigh

Most helpful comment

Good shot, I will fix this checking the PodScheduled.reason. If it will stay into Unschedulable state too long it will raise an exception.

All 24 comments

Let's start from the beginning. The error says:

ERROR ~ No such variable: USER

-- Check script 'main.nf' at line: 239 or see '.nextflow.log' file for more details

And if you at that line you will see that tries to access the USER env var.

It looks the that variable is not defined in the k8s pod. Frankly I don't know it this is a K8s feature or a problem with the container? Any idea with the K8s gurus in your team?

I have tried the Channel definition by itself, and I get the same error. But yes, $USER does not exist in k8s pods unless you put it there (by experiment).

Thanks @pditommaso, that's the power of fresh eyes! I'd battled with it for a few hours before posting and didn't realise that the $USER problem meant that the input channel didn't produce any error! So, I forked the nf-core/rnaseq, removed the $USER call and rerun and it's working!

ubuntu@k8s-master:~/kubespray$ ./nextflow kuberun wikiselev/rnaseq --genome GRCh37 --reads '/mnt/gluster/data/*{1,2}.fastq.gz' -v testpvc:/mnt/gluster/
Pod started: pedantic-legentil
N E X T F L O W  ~  version 0.28.0
Pulling wikiselev/rnaseq ...
 downloaded from https://github.com/wikiselev/rnaseq.git
Launching `wikiselev/rnaseq` [pedantic-legentil] - revision: 34a8221a87 [master]
===================================
 nfcore/rnaseq  ~  version 1.5dev
===================================
Run Name       : pedantic-legentil
Reads          : /mnt/gluster/data/*{1,2}.fastq.gz
Data Type      : Paired-End
Genome         : GRCh37
Strandedness   : None
Trim R1        : 0
Trim R2        : 0
Trim 3' R1     : 0
Trim 3' R2     : 0
Aligner        : STAR
STAR Index     : s3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/
GTF Annotation : s3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf
BED Annotation : s3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed
Save Reference : No
Save Trimmed   : No
Save Intermeds : No
Max Memory     : 128 GB
Max CPUs       : 16
Max Time       : 10d
Output dir     : ./results
Working dir    : /mnt/gluster/ubuntu/work
Container      : nfcore/rnaseq:latest
Pipeline Release: master
Current home   : /root
Current path   : /mnt/gluster/ubuntu
R libraries    : false
Script dir     : /mnt/gluster/projects/wikiselev/rnaseq
Config Profile : standard
=========================================
[warm up] executor > k8s
[warm up] executor > local
[c5/e0879a] Submitted process > workflow_summary_mqc
[32/c1d8ed] Submitted process > get_software_versions
[40/c138ac] Submitted process > fastqc (22028_2#118)
[c4/c561a6] Submitted process > fastqc (22028_2#11)
[38/3dfac0] Submitted process > fastqc (22028_2#111)
[32/c1d8ed] NOTE: Process `get_software_versions` terminated with an error exit status (127) -- Error is ignored
[66/9c9b11] Submitted process > trim_galore (22028_2#118)
[1d/b54ba3] Submitted process > trim_galore (22028_2#11)
[e0/395975] Submitted process > trim_galore (22028_2#111)

Now I will need the power of @theobarberbany to visualise the k8s cluster dashboard and look at the pods.

Thanks, @pditommaso !

Always read the error message :)

This is a good workaround, however it remains to understand if NF should inject the USER variable or K8s should do that.

Ok, it just failed due to not enough resources, but it provided a very informative error message!

May-02 20:29:21.828 [Task monitor] DEBUG nextflow.Session - Session aborted -- Cause: Invalid pod status -- missing container statuses
 {
     "kind": "Pod",
     "apiVersion": "v1",
     "metadata": {
         "name": "nf-52b131a30381b0b88926a9c12e5b1ff1",
         "namespace": "default",
         "selfLink": "/api/v1/namespaces/default/pods/nf-52b131a30381b0b88926a9c12e5b1ff1/status",
         "uid": "7f17fea0-4e47-11e8-89b1-fa163e31bb09",
         "resourceVersion": "2674905",
         "creationTimestamp": "2018-05-02T20:29:21Z",
         "labels": {
             "app": "nextflow",
             "processName": "star",
             "runName": "pedantic-legentil",
             "sessionId": "uuid-bb3f1a1f-ad9a-4e5f-a2f1-46a4d3fbd2a1",
             "taskName": "star_22028_2_118_1"
         }
     },
     "spec": {
         "volumes": [
             {
                 "name": "vol-7",
                 "persistentVolumeClaim": {
                     "claimName": "testpvc"
                 }
             },
             {
                 "name": "default-token-29mnj",
                 "secret": {
                     "secretName": "default-token-29mnj",
                     "defaultMode": 420
                 }
             }
         ],
         "containers": [
             {
                 "name": "nf-52b131a30381b0b88926a9c12e5b1ff1",
                 "image": "nfcore/rnaseq:latest",
                 "command": [
                     "/bin/bash",
                     "-ue",
                     ".command.run"
                 ],
                 "workingDir": "/mnt/gluster/ubuntu/work/52/b131a30381b0b88926a9c12e5b1ff1",
                 "resources": {
                     "limits": {
                         "cpu": "10",
                         "memory": "80Gi"
                     },
                     "requests": {
                         "cpu": "10",
                         "memory": "80Gi"
                     }
                 },
                 "volumeMounts": [
                     {
                         "name": "vol-7",
                         "mountPath": "/mnt/gluster"
                     },
                     {
                         "name": "default-token-29mnj",
                         "readOnly": true,
                         "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount"
                     }
                 ],
                 "terminationMessagePath": "/dev/termination-log",
                 "terminationMessagePolicy": "File",
                 "imagePullPolicy": "Always"
             }
         ],
         "restartPolicy": "Never",
         "terminationGracePeriodSeconds": 30,
         "dnsPolicy": "ClusterFirst",
         "serviceAccountName": "default",
         "serviceAccount": "default",
         "securityContext": {

         },
         "schedulerName": "default-scheduler"
     },
     "status": {
         "phase": "Pending",
         "conditions": [
             {
                 "type": "PodScheduled",
                 "status": "False",
                 "lastProbeTime": null,
                 "lastTransitionTime": "2018-05-02T20:29:21Z",
                 "reason": "Unschedulable",
                 "message": "0/4 nodes are available: 4 Insufficient cpu, 4 Insufficient memory."
             }
         ],
         "qosClass": "Guaranteed"
     }
 }

I am really excited, but time for a beer now! ;-)

Oh, I need to study carefully this. But it's time for my daily Netflix series dose :)

Could you include the complete error stack trace in the .nextflow.log file.

Hi Paolo, here it is: nextflow.log

Ok, basically what's happening is that NF looks for the containerStatus attribute in the status response object. If you look at your response you'll see that it missing:

    "status": {
         "phase": "Pending",
         "conditions": [
             {
                 "type": "PodScheduled",
                 "status": "False",
                 "lastProbeTime": null,
                 "lastTransitionTime": "2018-05-02T20:29:21Z",
                 "reason": "Unschedulable",
                 "message": "0/4 nodes are available: 4 Insufficient cpu, 4 Insufficient memory."
             }
         ],
         "qosClass": "Guaranteed"
     }

Now, I can make NF handling better this condition, but I'm not understanding if it's a permanent error as the message": "0/4 nodes are available: 4 Insufficient cpu, 4 Insufficient memory." is suggesting or if at some point the pod will is going to be executed.

What kubectl describe reports for this pod (nf-52b131a30381b0b88926a9c12e5b1ff1)?

@wikiselev I was paged? I'm not too sure I understand what you mean by 'visualise the k8s cluster dashboard and look at the pods.'? 馃槈

@pditommaso, that process requested 80GB of RAM which none of our k8s nodes had, so it would never be executed. I suppose NF was waiting to find any node but timed out. So, this can be actually tricky to handle, unless you can check the sizes of the nodes in advance and report to the user that it is physically not possible to run the process with so much memory.

@pditommaso Above, the Unschedulable pod that is type PodScheduled is allowed by the k8s scheduler as it will assume that a node will eventually be available that it can be scheduled on. Hence, it will just sit there and wait.

You can probe each node for available resources (Nodes.Status.Capacity)

https://godoc.org/k8s.io/api/core/v1#NodeStatus

Good shot, I will fix this checking the PodScheduled.reason. If it will stay into Unschedulable state too long it will raise an exception.

*good :)

I've uploaded a new build that should fix this issue, please give it a try using the following command:

NXF_VER=0.30.0-SNAPSHOT nextflow kuberun .. etc 

Hi Paolo, just tested it. It stopped immediately after starting a process with too much memory requirements.

Here is the nextflow.log.

And here is the pod description of the process:

ubuntu@vlad-k8s-test-k8s-master-nf-1:/mnt/gluster/pvc-319c8c17-3ca6-11e8-89b1-fa163e31bb09$ kubectl describe pod nf-8ea0ac9f3d28e490449cf2541583697b
Name:         nf-8ea0ac9f3d28e490449cf2541583697b
Namespace:    default
Node:         <none>
Labels:       app=nextflow
              processName=trim_galore
              runName=zen-lichterman
              sessionId=uuid-41dd33f2-54a1-433f-a726-a9471e41844c
              taskName=trim_galore_22028_2_11
Annotations:  <none>
Status:       Pending
IP:
Containers:
  nf-8ea0ac9f3d28e490449cf2541583697b:
    Image:  nfcore/rnaseq:latest
    Port:   <none>
    Command:
      /bin/bash
      -ue
      .command.run
    Limits:
      cpu:     2
      memory:  80Gi
    Requests:
      cpu:        2
      memory:     80Gi
    Environment:  <none>
    Mounts:
      /mnt/gluster from vol-2 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-29mnj (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  vol-2:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  testpvc
    ReadOnly:   false
  default-token-29mnj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-29mnj
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     <none>
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  56s (x15 over 4m)  default-scheduler  0/4 nodes are available: 4 Insufficient memory.

This also means that

k8s {
  cleanup = false
}

from #687 also worked well! So I can still see the unfinished pods:

ubuntu@vlad-k8s-test-k8s-master-nf-1:/mnt/gluster/pvc-319c8c17-3ca6-11e8-89b1-fa163e31bb09$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                                    READY     STATUS    RESTARTS   AGE
default       nf-8ea0ac9f3d28e490449cf2541583697b                     0/1       Pending   0          11m
default       nf-a1013023bd5da02377d702acc6dd5d4d                     0/1       Pending   0          11m
default       nf-b3d7eb417539d8914f6f237ca8f493d4                     0/1       Pending   0          11m
kube-system   hostpath-provisioner-64fd979b8c-khkdw                   1/1       Running   0          30d
kube-system   kube-apiserver-vlad-k8s-test-k8s-master-nf-1            1/1       Running   0          30d
kube-system   kube-controller-manager-vlad-k8s-test-k8s-master-nf-1   1/1       Running   0          30d
kube-system   kube-dns-79d99cdcd5-5b9zn                               3/3       Running   0          30d
kube-system   kube-dns-79d99cdcd5-bptsw                               3/3       Running   0          30d
kube-system   kube-flannel-747v6                                      2/2       Running   0          30d
kube-system   kube-flannel-vl6kt                                      2/2       Running   0          30d
kube-system   kube-flannel-w9glx                                      2/2       Running   0          30d
kube-system   kube-flannel-wq4hl                                      2/2       Running   0          30d
kube-system   kube-proxy-vlad-k8s-test-k8s-master-nf-1                1/1       Running   0          30d
kube-system   kube-proxy-vlad-k8s-test-k8s-node-nf-1                  1/1       Running   0          30d
kube-system   kube-proxy-vlad-k8s-test-k8s-node-nf-2                  1/1       Running   0          30d
kube-system   kube-proxy-vlad-k8s-test-k8s-node-nf-3                  1/1       Running   0          30d
kube-system   kube-scheduler-vlad-k8s-test-k8s-master-nf-1            1/1       Running   0          30d
kube-system   kubedns-autoscaler-5564b5585f-jjchw                     1/1       Running   0          30d
kube-system   kubernetes-dashboard-6bbb86ffc4-lgvwb                   1/1       Running   0          30d
kube-system   nginx-proxy-vlad-k8s-test-k8s-node-nf-1                 1/1       Running   0          30d
kube-system   nginx-proxy-vlad-k8s-test-k8s-node-nf-2                 1/1       Running   0          30d
kube-system   nginx-proxy-vlad-k8s-test-k8s-node-nf-3                 1/1       Running   0          30d

Well, if a process requests resources that are not available it should stop. Now it's returning a more explicit error message:

Pod is unschedulable -- cause: 0/4 nodes are available: 4 Insufficient memory.

Do you have a better proposal?

No, looks good, maybe only suggestion would be to move the pod description somewhere else from the log? Otherwise, there is too much unnecessary information about the pod, when the only thing needed is the sentence above that the pod is unschedulable.

Do you mean in the .nextflow.log file or the output printed in the screen? in the latter you can you attached the NF output?

In both, I would say. Here is the NF output:

ubuntu@k8s-master:~/kubespray$ NXF_VER=0.30.0-SNAPSHOT ./nextflow kuberun wikiselev/rnaseq --genome GRCh37 --reads '/mnt/gluster/data/*_{1,2}.fastq.gz' -v testpvc:/mnt/gluster
Pod started: zen-lichterman
N E X T F L O W  ~  version 0.30.0-SNAPSHOT
Pulling wikiselev/rnaseq ...
 downloaded from https://github.com/wikiselev/rnaseq.git
Launching `wikiselev/rnaseq` [zen-lichterman] - revision: 6d44261a24 [master]
===================================
 nfcore/rnaseq  ~  version 1.5dev
===================================
Run Name       : zen-lichterman
Reads          : /mnt/gluster/data/*_{1,2}.fastq.gz
Data Type      : Paired-End
Genome         : GRCh37
Strandedness   : None
Trim R1        : 0
Trim R2        : 0
Trim 3' R1     : 0
Trim 3' R2     : 0
Aligner        : STAR
STAR Index     : s3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/
GTF Annotation : s3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf
BED Annotation : s3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed
Save Reference : No
Save Trimmed   : No
Save Intermeds : No
Max Memory     : 80 GB
Max CPUs       : 7
Max Time       : 10d
Output dir     : ./results
Working dir    : /mnt/gluster/ubuntu/work
Container      : nfcore/rnaseq:latest
Pipeline Release: master
Current home   : /root
Current path   : /mnt/gluster/ubuntu
R libraries    : false
Script dir     : /mnt/gluster/projects/wikiselev/rnaseq
Config Profile : standard
=========================================
[warm up] executor > k8s
[warm up] executor > local
[6c/5e819d] Submitted process > workflow_summary_mqc
[c2/35c537] Submitted process > fastqc (22028_2#118)
[8e/a0ac9f] Submitted process > trim_galore (22028_2#11)
[cf/db0c82] Submitted process > fastqc (22028_2#111)
[b5/816b74] Submitted process > get_software_versions
[00/4e1040] Submitted process > fastqc (22028_2#11)
[b3/d7eb41] Submitted process > trim_galore (22028_2#118)
[a1/013023] Submitted process > trim_galore (22028_2#111)
[b5/816b74] NOTE: Process `get_software_versions` terminated with an error exit status (127) -- Error is ignored
ERROR ~ Error executing process > 'trim_galore (22028_2#11)'

Caused by:
  Pod is unschedulable -- cause: 0/4 nodes are available: 4 Insufficient memory.
 {
     "kind": "Pod",
     "apiVersion": "v1",
     "metadata": {
         "name": "nf-8ea0ac9f3d28e490449cf2541583697b",
         "namespace": "default",
         "selfLink": "/api/v1/namespaces/default/pods/nf-8ea0ac9f3d28e490449cf2541583697b/status",
         "uid": "ea950372-5475-11e8-89b1-fa163e31bb09",
         "resourceVersion": "3613822",
         "creationTimestamp": "2018-05-10T17:16:45Z",
         "labels": {
             "app": "nextflow",
             "processName": "trim_galore",
             "runName": "zen-lichterman",
             "sessionId": "uuid-41dd33f2-54a1-433f-a726-a9471e41844c",
             "taskName": "trim_galore_22028_2_11"
         }
     },
     "spec": {
         "volumes": [
             {
                 "name": "vol-2",
                 "persistentVolumeClaim": {
                     "claimName": "testpvc"
                 }
             },
             {
                 "name": "default-token-29mnj",
                 "secret": {
                     "secretName": "default-token-29mnj",
                     "defaultMode": 420
                 }
             }
         ],
         "containers": [
             {
                 "name": "nf-8ea0ac9f3d28e490449cf2541583697b",
                 "image": "nfcore/rnaseq:latest",
                 "command": [
                     "/bin/bash",
                     "-ue",
                     ".command.run"
                 ],
                 "workingDir": "/mnt/gluster/ubuntu/work/8e/a0ac9f3d28e490449cf2541583697b",
                 "resources": {
                     "limits": {
                         "cpu": "2",
                         "memory": "80Gi"
                     },
                     "requests": {
                         "cpu": "2",
                         "memory": "80Gi"
                     }
                 },
                 "volumeMounts": [
                     {
                         "name": "vol-2",
                         "mountPath": "/mnt/gluster"
                     },
                     {
                         "name": "default-token-29mnj",
                         "readOnly": true,
                         "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount"
                     }
                 ],
                 "terminationMessagePath": "/dev/termination-log",
                 "terminationMessagePolicy": "File",
                 "imagePullPolicy": "Always"
             }
         ],
         "restartPolicy": "Never",
         "terminationGracePeriodSeconds": 30,
         "dnsPolicy": "ClusterFirst",
         "serviceAccountName": "default",
         "serviceAccount": "default",
         "securityContext": {

         },
         "schedulerName": "default-scheduler"
     },
     "status": {
         "phase": "Pending",
         "conditions": [
             {
                 "type": "PodScheduled",
                 "status": "False",
                 "lastProbeTime": null,
                 "lastTransitionTime": "2018-05-10T17:16:45Z",
                 "reason": "Unschedulable",
                 "message": "0/4 nodes are available: 4 Insufficient memory."
             }
         ],
         "qosClass": "Guaranteed"
     }
 }



 -- Check '.nextflow.log' file for details
[nfcore/rnaseq] Pipeline Complete
WARN: Killing pending tasks (6)
WARN: To render the execution DAG in the required format it is required to install Graphviz -- See http://www.graphviz.org for more info.

or maybe just in the output... You probably want to keep it in the log...

It makes sense. I will try to simplify the error message. 馃憤

In the latest beta the K8s JSON response is reported only in the log files for known error conditions.

You may want to give a try using the following command:

NXF_VER=0.30.0-BETA1 nextflow kuberun .. etc

I've assume this is solved as well. For any testing please use 0.30.0-BETA2 as shown below:

NXF_VER=0.30.0-BETA2 nextflow kuberun .. etc
Was this page helpful?
0 / 5 - 0 ratings

Related issues

jaquol picture jaquol  路  7Comments

apeltzer picture apeltzer  路  7Comments

ewels picture ewels  路  4Comments

ewels picture ewels  路  6Comments

MaxUlysse picture MaxUlysse  路  3Comments