Nextflow: LevelDB Related Crash

Created on 25 May 2017  路  12Comments  路  Source: nextflow-io/nextflow

Hi,

We have a user running nextflow workflows on our cluster who is having issues with the JVM crashing with a fatal SIGBUS error. It looks related to LevelDB doing something wrong with memory access. See trace below and gist https://gist.github.com/dctrud/78282ae087b605844ce7790f1b146e37

I guess this is a LevelDB problem rather than a nextflow one, but wondering if anyone has seen this before? Any hints how to get past this? The traces given are on a Java7 JRE, but crashes on a Java8 JRE too.

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  sun.misc.Unsafe.copyMemory(Ljava/lang/Object;JLjava/lang/Object;JJ)V+0
j  java.nio.Bits.copyFromArray(Ljava/lang/Object;JJJJ)V+42
j  java.nio.DirectByteBuffer.put([BII)Ljava/nio/ByteBuffer;+98
j  org.iq80.leveldb.util.Slice.getBytes(ILjava/nio/ByteBuffer;)V+33
j  org.iq80.leveldb.impl.MMapLogWriter.writeChunk(Lorg/iq80/leveldb/impl/LogChunkType;Lorg/iq80/leveldb/util/Slice;)V+79
j  org.iq80.leveldb.impl.MMapLogWriter.addRecord(Lorg/iq80/leveldb/util/Slice;Z)V+200
j  org.iq80.leveldb.impl.DbImpl.writeInternal(Lorg/iq80/leveldb/impl/WriteBatchImpl;Lorg/iq80/leveldb/WriteOptions;)Lorg/iq80/leveldb/Snapshot;+72
j  org.iq80.leveldb.impl.DbImpl.put([B[BLorg/iq80/leveldb/WriteOptions;)Lorg/iq80/leveldb/Snapshot;+14
j  org.iq80.leveldb.impl.DbImpl.put([B[B)V+10
j  nextflow.CacheDB.writeTaskEntry0(Lnextflow/processor/TaskHandler;)V+179
j  sun.reflect.GeneratedMethodAccessor717.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+40
J  org.codehaus.groovy.reflection.CachedMethod.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
j  groovy.lang.MetaMethod.doMethodInvoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+9
j  groovy.lang.MetaClassImpl.invokeMethod(Ljava/lang/Class;Ljava/lang/Object;Ljava/lang/String;[Ljava/lang/Object;ZZ)Ljava/lang/Object;+1071
j  groovy.lang.MetaClassImpl.invokeMethod(Ljava/lang/Object;Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/Object;+10
J  org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(Ljava/lang/Object;Ljava/lang/String;Ljava/lang/Object;)Ljava/lang/Object;
j  org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(Ljava/lang/Object;Ljava/lang/String;Ljava/lang/Object;)Ljava/lang/Object;+7
j  nextflow.CacheDB$_putTaskAsync_closure1.doCall(Ljava/lang/Object;)Ljava/lang/Object;+23
j  nextflow.CacheDB$_putTaskAsync_closure1.call(Ljava/lang/Object;)Ljava/lang/Object;+2
j  groovyx.gpars.agent.AgentBase.onMessage(Lgroovy/lang/Closure;)V+45
j  groovyx.gpars.agent.Agent.handleMessage(Ljava/lang/Object;)V+12
j  groovyx.gpars.agent.AgentCore$1.handleMessage(Ljava/lang/Object;)V+5
j  groovyx.gpars.util.AsyncMessagingCore.run()V+51
j  java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub

Most helpful comment

For reference from link below, it seems the underlying issue may be that leveldb uses ftruncate, not fallocate before mmap. fallocate would give an error r.e. no space - but ftruncate does not (it's lazy), and when mmap happens the file may not be the expected size, so there is then a SIGBUS as the mmap doesn't match the file on disk.

https://groups.google.com/forum/#!topic/leveldb/9VaANZvbYlk

All 12 comments

This is very unusual. LevelDB is very stable and I've never seen such stack trace before. This makes me think that the pipeline it trying to store, for some reason, an object to big to be contained in memory.

A possible reason could be the use of unneeded variables in the script context. For example:

process foo {
  input:
  file genome from ch 

  script: 
  bar = genome.name 
  """  
  echo $bar
  """
}

Variable declared like bar enter in the script binding context and they are saved in the cache DB. My speculation is that, if for some reason such variable is assigned a big value, that could cause that problem.

To avoid this you can try to declare all variables like the above as local eg. def bar = something.

I'm the user that's having trouble with our nextflow scripts. I don't think there are any unneeded variables in the script context.

The script i'm running is up on GitHub here (https://github.com/c-guzman/rnaseq-nf), i've combed it quickly, but couldn't find anything specific. I know that deeptools has been known to take up quite a bit of RAM and cache space, but I don't know if this is the problem.

There are multiple entries reporting that's a bug causing Java to crash when local /tmp is full. See here and here. Though the bug is reported as fixed I would check if it could be the problem.

Also it happens in a systematic manner? is there any way to reproduce it?

From the system-side of things... Nextflow is being launched at the login node of our cluster, and submitting jobs into the SLURM queue for process to run on cluster nodes.

The trace is showing the head node (where Nextflow runs) has plenty of free RAM at time of crash:

physical 132055256k(91508136k free)

The /tmp isn't full at all:

/dev/sda3 61G 224M 57G 1% /tmp

Nothing unusual in the dmesg or syslog.

Real head scratcher.

  1. Is this a new pipeline?
  2. If no since when it started to raise this issue?
  3. may it be related to some specific dataset ?
  4. are you able to replicate the issue ?

Maybe this is a disk space issue - but not /tmp. Where does Nextflow keep the LevelDB file?

@c-guzman - I'm thinking about this as you are close to (group) quota on the project filesystem. Maybe rule this out first, by running on the work filesystem where you have space? Can you try this and if still an issue then it's worth following up with @pditommaso's questions above.

@pditommaso those questions I'll have to leave to Carlos. I'll jump in if there is any other system specific stuff I might be able to ask/assist checking.

Level DB files are stored in the pipeline launching directory under the .nextflow directory

Also it could be useful to have the Java core dump file for a post-mortem analysis. Stating this line is not produced by the current ulimit setting. See also here.

@dctrud @pditommaso I ran the workflow in our /work directory as suggested by David and the pipeline completed successfully. Looks like this is a disk space issue.

@pditommaso thanks for pointing toward a disk space issue - I would not have guessed that from a SIGBUS error. Closing this.

For reference from link below, it seems the underlying issue may be that leveldb uses ftruncate, not fallocate before mmap. fallocate would give an error r.e. no space - but ftruncate does not (it's lazy), and when mmap happens the file may not be the expected size, so there is then a SIGBUS as the mmap doesn't match the file on disk.

https://groups.google.com/forum/#!topic/leveldb/9VaANZvbYlk

Was this page helpful?
0 / 5 - 0 ratings

Related issues

MaxUlysse picture MaxUlysse  路  3Comments

lindenb picture lindenb  路  3Comments

stevekm picture stevekm  路  5Comments

apeltzer picture apeltzer  路  7Comments

ewels picture ewels  路  6Comments