Nextflow: K8s executor when the cluster has no available cpus

Created on 3 Aug 2018  路  8Comments  路  Source: nextflow-io/nextflow

Bug report

Expected behavior and actual behavior

Expected: If there are no resources available and the job resource requirements are less than any available node, wait until they become available.

Actual: It feels like NF has a time out after which it decides not to schedule a job and reports an error

Program output

We started NF on our cluster which had 4 nodes with 54 cpus and 453.5GB each. We ran our pipeline with 57 samples in a process that required 4 cpus for each sample. So when they started running in parallel there was not enough cpus for all of them. Here is what NF reported:

fabulous-keller:/mnt/gluster/tic-31# ./RESUME
4943: final merge table will be in results/4943/featureCounts/NF-4943-fc-genecounts.txt
Running pipeline
N E X T F L O W  ~  version 0.31.0
Launching `cellgeni/rnaseq-noqc` [disturbed_pauling] - revision: 7b3d85f0fe [master]
WARN: The config file defines settings for an unknown process: makeSTARindex
=========================================
         RNASeq pipeline v1.5
=========================================
Run Name       : disturbed_pauling
Sample file    : tables/samples-4943.txt
Data Type      : Paired-End
Genome         : GRCh38
Biotype tag    : gene_biotype
Strandedness   : None
Aligner        : STAR
STAR Index     : /mnt/gluster/genomes/indexes/GRCh38/release-91/75/star/
GTF Annotation : /mnt/gluster/genomes/ensembl/GRCh38/Homo_sapiens.GRCh38.91.gtf
BED Annotation : /mnt/gluster/genomes/ensembl/GRCh38/Homo_sapiens.GRCh38.91.bed
Save Intermeds : No
Max Memory     : 128 GB
Max CPUs       : 16
Max Time       : 12h
Output dir     : results/4943
Working dir    : /mnt/gluster/tic-31/work
Container      : [$irods:quay.io/cellgeni/irods, $crams_to_fastq:quay.io/biocontainers/samtools:1.8--4, $star:quay.io/biocontainers/star:2.5.4a--0, $featureCounts:quay.io/biocontainers/subread:1.6.2--ha92aebf_0, $merge_featureCounts:python:3.7, $salmon:quay.io/biocontainers/salmon:0.9.1--1]
Pipeline Release: master
Current home   : /root
Current path   : /mnt/gluster/tic-31
Script dir     : /mnt/gluster/projects/cellgeni/rnaseq-noqc
Config Profile : docker
=========================================
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$irods = <value>` with a process selector
[warm up] executor > k8s
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$crams_to_fastq = <value>` with a process selector
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$star = <value>` with a process selector
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$featureCounts = <value>` with a process selector
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$merge_featureCounts = <value>` with a process selector
[2e/63ff90] Submitted process > irods (iPSCard7082644)
[1c/abb742] Submitted process > irods (iPSCard7082643)
[58/58b4e8] Submitted process > irods (iPSCard7082629)
[ff/c20ac6] Submitted process > irods (iPSCard7082645)
[af/09bd3e] Submitted process > irods (iPSCard7082646)
[a3/26e43f] Submitted process > irods (iPSCard7082637)
[1e/346c20] Submitted process > irods (iPSCard7082636)
[6e/60d13d] Submitted process > irods (iPSCard7082632)
[96/2019f5] Submitted process > irods (iPSCard7082621)
[ef/b884cc] Submitted process > irods (iPSCard7082623)
[3d/6c8692] Submitted process > irods (iPSCard7082633)
[0e/ae2d13] Submitted process > irods (iPSCard7082631)
[56/098923] Submitted process > irods (iPSCard7082627)
[4e/e04338] Submitted process > irods (iPSCard7082640)
[f5/8cbf57] Submitted process > irods (iPSCard7082628)
[0a/3e3794] Submitted process > irods (iPSCard7082642)
[e8/ff0bbd] Submitted process > irods (iPSCard7082620)
[29/061605] Submitted process > irods (iPSCard7082622)
[9d/0bfd4b] Submitted process > irods (iPSCard7082634)
[a2/08067e] Submitted process > irods (iPSCard7082641)
[b4/b9026a] Submitted process > irods (iPSCard7082630)
[6e/3cc1db] Submitted process > irods (iPSCard7082626)
[36/6dac11] Submitted process > irods (iPSCard7082638)
[95/013f5e] Submitted process > irods (iPSCard7082635)
[d4/af12ee] Submitted process > irods (iPSCard7082624)
[b7/7bfd5c] Submitted process > irods (iPSCard7082647)
[7b/d0cf7f] Submitted process > irods (iPSCard7082639)
[57/9f5050] Submitted process > irods (iPSCard7082625)
[67/41a0f8] Submitted process > irods (iPSCard7082648)
[56/7079da] Submitted process > crams_to_fastq (iPSCard7082640)
[8b/75ba99] Submitted process > irods (iPSCard7082649)
[a9/ed6c05] Submitted process > irods (iPSCard7082650)
[a9/7e19b1] Submitted process > crams_to_fastq (iPSCard7082647)
[4b/a6415c] Submitted process > irods (iPSCard7082651)
[14/408363] Submitted process > crams_to_fastq (iPSCard7082638)
[9b/3f70ad] Submitted process > irods (iPSCard7082652)
[01/c3a514] Submitted process > crams_to_fastq (iPSCard7082644)
[df/738d27] Submitted process > irods (iPSCard7082653)
[56/d2203b] Submitted process > crams_to_fastq (iPSCard7082631)
[83/929c9f] Submitted process > crams_to_fastq (iPSCard7082646)
[8c/07a2b1] Submitted process > irods (iPSCard7082654)
[ca/614d72] Submitted process > irods (iPSCard7082655)
[ac/f76c4d] Submitted process > crams_to_fastq (iPSCard7082628)
[bd/137851] Submitted process > irods (iPSCard7082656)
[fe/404f7a] Submitted process > crams_to_fastq (iPSCard7082641)
[90/3431bd] Submitted process > irods (iPSCard7082657)
[f9/c43e85] Submitted process > crams_to_fastq (iPSCard7082639)
[96/b1d95a] Submitted process > irods (iPSCard7082658)
[5d/06a4e9] Submitted process > crams_to_fastq (iPSCard7082636)
[35/5dcd67] Submitted process > irods (iPSCard7082659)
[cb/39f282] Submitted process > crams_to_fastq (iPSCard7082623)
[a2/33ea78] Submitted process > crams_to_fastq (iPSCard7082632)
[0a/97c505] Submitted process > irods (iPSCard7082661)
[45/d6da9b] Submitted process > irods (iPSCard7082660)
[b9/809f35] Submitted process > crams_to_fastq (iPSCard7082643)
[c5/47c492] Submitted process > irods (iPSCard7082662)
[fb/887353] Submitted process > crams_to_fastq (iPSCard7082633)
[e1/833d8f] Submitted process > irods (iPSCard7082663)
[10/1d0975] Submitted process > crams_to_fastq (iPSCard7082648)
[11/be7ae4] Submitted process > irods (iPSCard7082664)
[21/f6871d] Submitted process > crams_to_fastq (iPSCard7082645)
[4a/5413a2] Submitted process > irods (iPSCard7082665)
[85/dbf707] Submitted process > crams_to_fastq (iPSCard7082637)
[49/441310] Submitted process > irods (iPSCard7082666)
[21/e9beda] Submitted process > irods (iPSCard7082667)
[50/adaed6] Submitted process > crams_to_fastq (iPSCard7082621)
[e3/cfb9a0] Submitted process > crams_to_fastq (iPSCard7082627)
[d8/5bedad] Submitted process > irods (iPSCard7082668)
[ba/a97b04] Submitted process > crams_to_fastq (iPSCard7082620)
[43/2ab02e] Submitted process > irods (iPSCard7082669)
[c9/3a9374] Submitted process > crams_to_fastq (iPSCard7082626)
[38/f1be74] Submitted process > irods (iPSCard7082670)
[68/fd7e3c] Submitted process > crams_to_fastq (iPSCard7082634)
[56/fa4377] Submitted process > crams_to_fastq (iPSCard7082642)
[6e/1d75a2] Submitted process > irods (iPSCard7082671)
[58/27dd6a] Submitted process > crams_to_fastq (iPSCard7082635)
[23/807e3d] Submitted process > irods (iPSCard7082672)
[e0/c4dfa1] Submitted process > irods (iPSCard7082673)
[dc/2b5fdd] Submitted process > crams_to_fastq (iPSCard7082624)
[69/0516f8] Submitted process > crams_to_fastq (iPSCard7082629)
[58/3d044f] Submitted process > irods (iPSCard7082675)
[be/51109d] Submitted process > irods (iPSCard7082674)
[1a/83a999] Submitted process > crams_to_fastq (iPSCard7082625)
[46/ebf722] Submitted process > irods (iPSCard7082676)
[d2/4a78fc] Submitted process > crams_to_fastq (iPSCard7082622)
[08/7fa30e] Submitted process > crams_to_fastq (iPSCard7082630)
[84/b0c04e] Submitted process > crams_to_fastq (iPSCard7082649)
[f1/85e52b] Submitted process > crams_to_fastq (iPSCard7082651)
[70/8277d9] Submitted process > crams_to_fastq (iPSCard7082654)
[50/5aba22] Submitted process > crams_to_fastq (iPSCard7082650)
[2d/75c8e4] Submitted process > crams_to_fastq (iPSCard7082653)
[3a/f94fd7] Submitted process > crams_to_fastq (iPSCard7082655)
[1c/77bd1a] Submitted process > crams_to_fastq (iPSCard7082660)
[f4/5fad57] Submitted process > crams_to_fastq (iPSCard7082652)
[a0/adbe4f] Submitted process > crams_to_fastq (iPSCard7082662)
[d6/e5ddae] Submitted process > crams_to_fastq (iPSCard7082671)
[df/f1daab] Submitted process > crams_to_fastq (iPSCard7082656)
[8d/ac2bf8] Submitted process > crams_to_fastq (iPSCard7082657)
[4b/00db4f] Submitted process > crams_to_fastq (iPSCard7082658)
[28/16147b] Submitted process > crams_to_fastq (iPSCard7082663)
[ae/bb0457] Submitted process > crams_to_fastq (iPSCard7082664)
[92/373338] Submitted process > crams_to_fastq (iPSCard7082668)
[bc/7d5d18] Submitted process > crams_to_fastq (iPSCard7082669)
[8b/c0ce73] Submitted process > crams_to_fastq (iPSCard7082659)
[47/81b227] Submitted process > crams_to_fastq (iPSCard7082661)
[07/2ddfb8] Submitted process > crams_to_fastq (iPSCard7082665)
[a2/a4e5be] Submitted process > crams_to_fastq (iPSCard7082674)
[59/eb5e4c] Submitted process > crams_to_fastq (iPSCard7082676)
ERROR ~ Error executing process > 'crams_to_fastq (iPSCard7082676)'

Caused by:
  K8s pod cannot be scheduled -- 0/5 nodes are available: 1 Insufficient memory, 5 Insufficient cpu.


 -- Check '.nextflow.log' file for details
WARN: Killing pending tasks (57)

The log file is here: .nextflow.log

The pod description is here (so the pod produced a Warning, not an Error):

ubuntu@cellgen-kub-k8s-master-nf-1:/mnt/gluster/tic-31$ kubectl describe pods nf-59eb5e4c1098160fd0bc8717f773ee9b
Name:         nf-59eb5e4c1098160fd0bc8717f773ee9b
Namespace:    default
Node:         cellgen-kub-k8s-node-nf-1/10.0.0.15
Start Time:   Fri, 03 Aug 2018 13:07:39 +0000
Labels:       app=nextflow
              processName=crams_to_fastq
              runName=disturbed_pauling
              sessionId=uuid-dfbc9a32-64bc-47b4-b3ff-151b9ed14d06
              taskName=crams_to_fastq_iPSCard7082676
Annotations:  <none>
Status:       Running
IP:           10.233.67.206
Containers:
  nf-59eb5e4c1098160fd0bc8717f773ee9b:
    Container ID:  docker://490deaca7b2e0d442ccf80ff156e3307b8c130ec4023ed0812100109151fe434
    Image:         quay.io/biocontainers/samtools:1.8--4
    Image ID:      docker-pullable://quay.io/biocontainers/samtools@sha256:26c59a82e9270d00ea11faba4ef2b07b15d9120a7daba63ce682b5a90a2a8357
    Port:          <none>
    Command:
      /bin/bash
      -ue
      .command.run
    State:          Running
      Started:      Fri, 03 Aug 2018 13:07:50 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     4
      memory:  4Gi
    Requests:
      cpu:        4
      memory:     4Gi
    Environment:  <none>
    Mounts:
      /mnt/gluster from vol-165 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-s9rlq (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          True
  PodScheduled   True
Volumes:
  vol-165:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  testpvc
    ReadOnly:   false
  default-token-s9rlq:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-s9rlq
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     <none>
Events:
  Type     Reason                 Age                From                                Message
  ----     ------                 ----               ----                                -------
  Warning  FailedScheduling       17m (x4 over 17m)  default-scheduler                   0/5 nodes are available: 1 Insufficient memory, 5 Insufficient cpu.
  Normal   Scheduled              17m                default-scheduler                   Successfully assigned nf-59eb5e4c1098160fd0bc8717f773ee9b to cellgen-kub-k8s-node-nf-1
  Normal   SuccessfulMountVolume  17m                kubelet, cellgen-kub-k8s-node-nf-1  MountVolume.SetUp succeeded for volume "default-token-s9rlq"
  Normal   SuccessfulMountVolume  17m                kubelet, cellgen-kub-k8s-node-nf-1  MountVolume.SetUp succeeded for volume "glusterfs"
  Normal   Pulled                 17m                kubelet, cellgen-kub-k8s-node-nf-1  Container image "quay.io/biocontainers/samtools:1.8--4" already present on machine
  Normal   Created                17m                kubelet, cellgen-kub-k8s-node-nf-1  Created container
  Normal   Started                17m                kubelet, cellgen-kub-k8s-node-nf-1  Started container

Steps to reproduce the problem

Run a pipeline which requires more resources than available on K8s cluster.

Environment

  • Nextflow version: [0.30.2.4867]
  • Java version: [1.8.0_181]
  • Operating system: [Linux Buildroot 2014.02]
platfork8s prhigh

All 8 comments

Related to #773

A possibile solution is to show a warning instead of throwing an exception when a pod is into Unschedulable status eg.

     "status": {
          "phase": "Pending",
          "conditions": [
              {
                  "type": "PodScheduled",
                  "status": "False",
                  "lastProbeTime": null,
                  "lastTransitionTime": "2018-11-22T17:01:54Z",
                  "reason": "Unschedulable",
                  "message": "0/1 nodes are available: 1 Insufficient cpu."
              }
          ],
          "qosClass": "Burstable"
      }

and add configurable optional timeout to kill the pod when the timeout exceed.

This will be really helpful, thanks Paolo!

Hi Vlad, this is already pushed in the master. You should be able to make your own build and test it.

Great, many thanks Paolo. I actually would like to try and build it (to start coming back to NF and maybe learning some groovy!). Do you have instructions on what's needed for building it? and how one could contribute? (I know we discussed this on the hackathon, but maybe there are some general principles described somewhere... I've never worked with Java/groovy before) So, if there is something which will help me start that would be perfect!

Easy, the project README and CONTRIBUTING files. For everything else ping me.

Many thanks! Hahaha, ok, sorry for my ignorance, I really never scrolled down the main NF github page ;-)

Included in version 18.12.0-edge.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wflynny picture wflynny  路  6Comments

MaxUlysse picture MaxUlysse  路  3Comments

ewels picture ewels  路  6Comments

apeltzer picture apeltzer  路  7Comments

rsuchecki picture rsuchecki  路  3Comments