Presto: Spilling fails - failed to read spilled pages

Created on 31 Jul 2017  路  4Comments  路  Source: prestodb/presto

When running a query with spilling enabled we get an error:

"Query 20170731_153056_00334_yexgm failed: Failed to read spilled pages"

We're running this on EMR 5.7.0 which has Presto 0.170. What could be possible reasons? What information I could provide to help investigate this? All params are default as set by EMR 5.7.0 but I can provide specifics if needed.

Thank you

Most helpful comment

Adding

presto soft nofile 32768
presto hard nofile 65536
presto soft nproc 32768
presto hard nproc 65536

to /etc/security/limits.conf on every node solved our issue. On EMR this can be done with bootstrap actions script.

All 4 comments

It would be useful to see the stack trace of the worker that failed to read the spilled pages. You can find the failed task from the web ui and then go to the worker running that task and get the logs.

@nezihyigitbasi here's the stack trace:

com.facebook.presto.spi.PrestoException: Failed to read spilled pages
    at com.facebook.presto.spiller.BinaryFileSpiller.readPages(BinaryFileSpiller.java:120)
    at com.facebook.presto.spiller.BinaryFileSpiller.lambda$getSpills$1(BinaryFileSpiller.java:108)
    at java.util.stream.IntPipeline$4$1.accept(IntPipeline.java:250)
    at java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:110)
    at java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:693)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
    at com.facebook.presto.spiller.BinaryFileSpiller.getSpills(BinaryFileSpiller.java:109)
    at com.facebook.presto.operator.aggregation.builder.SpillableHashAggregationBuilder.mergeFromDisk(SpillableHashAggregationBuilder.java:256)
    at com.facebook.presto.operator.aggregation.builder.SpillableHashAggregationBuilder.buildResult(SpillableHashAggregationBuilder.java:187)
    at com.facebook.presto.operator.HashAggregationOperator.getOutput(HashAggregationOperator.java:438)
    at com.facebook.presto.operator.Driver.processInternal(Driver.java:303)
    at com.facebook.presto.operator.Driver.lambda$processFor$6(Driver.java:234)
    at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:537)
    at com.facebook.presto.operator.Driver.processFor(Driver.java:229)
    at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
    at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:624)
    at com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:776)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: /tmp/presto/spills/presto-spill1467464677085147065/140.bin (Too many open files)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at com.facebook.presto.spiller.BinaryFileSpiller.readPages(BinaryFileSpiller.java:115)
    ... 23 more

Seems it's too many open files. I think this is most likely related. https://groups.google.com/forum/#!topic/presto-users/-N_-5-J0cSE .. Will try to figure out how to increase the limit on EMR to see if the issue goes away. Thank you for your help! Will close if it works fine.

Yeah that's the reason: Too many open files. Bump up the max file descriptors limit as mentioned in that e-mail and please re-open if that doesn't solve your problem.

Adding

presto soft nofile 32768
presto hard nofile 65536
presto soft nproc 32768
presto hard nproc 65536

to /etc/security/limits.conf on every node solved our issue. On EMR this can be done with bootstrap actions script.

Was this page helpful?
0 / 5 - 0 ratings