Presto: How to avoid Full GC in Presto?

Created on 20 Feb 2020  路  8Comments  路  Source: prestosql/presto

Once Full GC, Presto will drop many queries at that time.
JVM Properties are as follow:

-Xmx350G
-Xms350G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UnlockExperimentalVMOptions
-XX:InitiatingHeapOccupancyPercent=40
-XX:MaxGCPauseMillis=200
-XX:MetaspaceSize=512M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/presto
-XX:+ExitOnOutOfMemoryError
-XX:ReservedCodeCacheSize=1024M

-Xloggc:/var/log/presto/presto-gc.log
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintTenuringDistribution
-XX:+PrintReferenceGC
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintHeapAtGC
-XX:+PrintClassHistogramAfterFullGC
-XX:+PrintClassHistogramBeforeFullGC

-XX:-UseBiasedLocking
-XX:AutoBoxCacheMax=20000
-XX:+AlwaysPreTouch
-XX:+UseStringDeduplication

-Dcom.sun.management.jmxremote=true
-Dcom.sun.management.jmxremote.port=9015
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false

-Dcom.sun.jndi.ldap.object.disableEndpointIdentification=true
-Djdk.nio.maxCachedBufferSize=2000000
-Djdk.attach.allowAttachSelf=true
-javaagent:/opt/jmx_prometheus_javaagent-0.12.0.jar=9998:/etc/presto/jmxexporter.yaml

-Xbootclasspath/p:/etc/alluxio/

The GC log:

2020-02-20T16:38:40.400+0800: 1165.577: [Full GC (Allocation Failure) 2020-02-20T16:38:40.400+0800: 1165.577: [Class Histogram (before full gc):
 num     #instances         #bytes  class name
----------------------------------------------
   1:       6041945   119479563192  [B 
   2:       5645159    92449902600  [J
   3:       5689449    81532678704  [I
   4:       4229508     5855555328  [Z
   5:     108443385     4337735400  org.apache.parquet.io.api.Binary$ByteArraySliceBackedBinary
   6:      13353501     2016751824  [C
   7:       2945072      890589856  [Ljava.lang.Object;
   8:         15987      867026392  [Lorg.apache.parquet.io.api.Binary;
   9:       8017310      705523280  java.util.LinkedHashMap
  10:         68130      678153176  [D
  11:       9580117      613127488  io.prestosql.spi.block.VariableWidthBlock
  12:      14536937      465181984  java.lang.String
  13:       7234162      462986368  java.util.LinkedHashMap$Entry
  14:       8025619      433916440  [Ljava.util.HashMap$Node;
  15:        348788      384492232  [Ljava.lang.String;
  16:       7825642      375630816  java.util.Collections$UnmodifiableMap
  17:       3883204      248525056  org.apache.parquet.hadoop.metadata.IntColumnChunkMetaData
  18:       4954717      237826416  java.util.HashMap$Node
  19:       4359921      209276208  java.util.LinkedList
  20:        359960      194986624  [Lio.prestosql.spi.block.Block;
  21:       2434564      194765120  org.apache.parquet.schema.PrimitiveType
  22:       4456340      178253600  java.util.LinkedList$Node
  23:       2532452      162076928  org.glassfish.jersey.internal.util.collection.KeyComparatorLinkedHashMap$Entry
  24:       3881814      155272560  org.apache.parquet.column.EncodingStats
  25:       1437724      149523296  java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask
  26:       2836597      136156656  org.eclipse.jetty.http.HttpField
  27:        427226      116205472  org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage
  28:        800405      115258320  java.net.URI
  29:       2011908      112666848  io.prestosql.spi.block.LongArrayBlock
  30:       2649124      105964960  java.util.ArrayList


2020-02-20T16:39:02.405+0800: 1187.582: [SoftReference, 0 refs, 0.0000421 secs]2020-02-20T16:39:02.405+0800: 1187.582: [WeakReference, 23138 refs, 0.0040229 secs]2020-02-20T16:39:02.409+0800: 1187.586: [FinalReference, 398 refs, 0.0002524 secs]2020-02-20T16:39:02.409+0800: 1187.586: [PhantomReference, 0 refs, 10806 refs, 0.0009909 secs]2020-02-20T16:39:02.410+0800: 1187.587: [JNI Weak Reference, 0.0037439 secs] 332G->135G(350G), 72.6502685 secs]
   [Eden: 0.0B(17.5G)->0.0B(17.5G) Survivors: 0.0B->0.0B Heap: 332.0G(350.0G)->135.8G(350.0G)], [Metaspace: 186703K->176459K(256000K)]
Heap after GC invocations=499 (full 1):
 garbage-first heap   total 367001600K, used 142347592K [0x00007ef890000000, 0x00007ef892015e00, 0x00007f5010000000)
  region size 32768K, 0 young (0K), 0 survivors (0K)
 Metaspace       used 176459K, capacity 225216K, committed 254720K, reserved 256000K
}
2020-02-20T16:39:53.051+0800: 1238.228: [Class Histogram (after full gc):
 num     #instances         #bytes  class name
----------------------------------------------
   1:       3689146    69481974544  [J
   2:       1261780    47544827216  [B
   3:       3009085    22499458120  [I
   4:       1819453     3133862920  [Z
   5:        606548      422337944  [C
   6:        345202      386081352  [Ljava.lang.Object;
   7:         83674      339502976  [Ljava.lang.String;
   8:         49606      327601208  [D
   9:       7648893      305955720  org.apache.parquet.io.api.Binary$ByteArraySliceBackedBinary
  10:        315022      147801784  [Lio.prestosql.spi.block.Block;
  11:       1798665      100725240  io.prestosql.spi.block.LongArrayBlock
  12:       1109348       70998272  io.prestosql.spi.block.VariableWidthBlock
  13:          1400       61224744  [Lorg.apache.parquet.io.api.Binary;
  14:       1060502       59388112  io.prestosql.spi.block.IntArrayBlock
  15:       1005316       56297696  io.airlift.slice.Slice
  16:        763044       36626112  io.airlift.stats.DecayCounter
  17:       1523039       36552936  java.util.concurrent.atomic.AtomicLong
  18:        950995       30431840  java.lang.String
  19:        833893       26684576  io.prestosql.spi.block.BlockBuilderStatus
  20:          1756       20542040  [[J
  21:        313047       20035008  io.prestosql.spi.block.LongArrayBlockBuilder
  22:        299678       19179392  io.prestosql.spi.block.Int128ArrayBlockBuilder
  23:        171139       18011256  [Ljava.util.HashMap$Node;
  24:        337481       16199088  java.util.HashMap$Node
  25:        262498       12599904  org.antlr.v4.runtime.atn.ATNConfig
  26:        250316       12015168  io.airlift.stats.CounterStat
  27:        233252       11196096  java.util.concurrent.ConcurrentHashMap$Node
  28:        123541       10871608  java.util.LinkedHashMap
  29:          5670       10508656  [Lorg.antlr.v4.runtime.dfa.DFAState;
  30:        128778       10302240  io.prestosql.spi.block.VariableWidthBlockBuilder

It will take more than 60s!

Most helpful comment

We'd love to hear more about your experiences with running Presto with ZGC, since this is new and we haven't heard any reports yet. Please do follow up and let us know how things go (good or bad).

All 8 comments

What Java version are you using?

Also, are you running the most recent release?

The Java version:
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)

The Presto version is Release 310.

Couple of things I would try:

  • See if you get better performance with Java 11

  • Ideally combined with an upgrade to latest Presto 330 release, there were a few parquet and other performance improvements since 310 that might benefit you

  • remove alluxio to reduce complexity of the problem and see if that changes behavior

If the problem still persists then, you might have to tweak JVM settings for GC. E.g. there might be an issue with your metaspacesize config (e.g. see https://stackoverflow.com/questions/31075761/java-8-reserves-minimum-1g-for-metaspace-despite-maxmetaspacesize#31104372)

Thank you so much for your advice @mosabua
I have tried the things that you suggest, but the problem still exists.

Fortunately, I upgrade jdk to version 11 and use ZGC the problem has been sovled.
This(ZGC) is really an amazing gc!

Btw in case others are finding this issue.

https://wiki.openjdk.java.net/display/zgc/Main

You will want to use JDK 12+ as ZGC in JDK 11 does not support class unloading. Presto generates classes for every query, so without class unloading, you'll leak memory due these generated classes. Your heap is large enough that you might not notice this problem for a while, depending on your query patterns (generated classes are cached for a period of time).

We'd love to hear more about your experiences with running Presto with ZGC, since this is new and we haven't heard any reports yet. Please do follow up and let us know how things go (good or bad).

Was this page helpful?
0 / 5 - 0 ratings