Once Full GC, Presto will drop many queries at that time.
JVM Properties are as follow:
-Xmx350G
-Xms350G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UnlockExperimentalVMOptions
-XX:InitiatingHeapOccupancyPercent=40
-XX:MaxGCPauseMillis=200
-XX:MetaspaceSize=512M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/presto
-XX:+ExitOnOutOfMemoryError
-XX:ReservedCodeCacheSize=1024M
-Xloggc:/var/log/presto/presto-gc.log
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintTenuringDistribution
-XX:+PrintReferenceGC
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintHeapAtGC
-XX:+PrintClassHistogramAfterFullGC
-XX:+PrintClassHistogramBeforeFullGC
-XX:-UseBiasedLocking
-XX:AutoBoxCacheMax=20000
-XX:+AlwaysPreTouch
-XX:+UseStringDeduplication
-Dcom.sun.management.jmxremote=true
-Dcom.sun.management.jmxremote.port=9015
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.jndi.ldap.object.disableEndpointIdentification=true
-Djdk.nio.maxCachedBufferSize=2000000
-Djdk.attach.allowAttachSelf=true
-javaagent:/opt/jmx_prometheus_javaagent-0.12.0.jar=9998:/etc/presto/jmxexporter.yaml
-Xbootclasspath/p:/etc/alluxio/
The GC log:
2020-02-20T16:38:40.400+0800: 1165.577: [Full GC (Allocation Failure) 2020-02-20T16:38:40.400+0800: 1165.577: [Class Histogram (before full gc):
num #instances #bytes class name
----------------------------------------------
1: 6041945 119479563192 [B
2: 5645159 92449902600 [J
3: 5689449 81532678704 [I
4: 4229508 5855555328 [Z
5: 108443385 4337735400 org.apache.parquet.io.api.Binary$ByteArraySliceBackedBinary
6: 13353501 2016751824 [C
7: 2945072 890589856 [Ljava.lang.Object;
8: 15987 867026392 [Lorg.apache.parquet.io.api.Binary;
9: 8017310 705523280 java.util.LinkedHashMap
10: 68130 678153176 [D
11: 9580117 613127488 io.prestosql.spi.block.VariableWidthBlock
12: 14536937 465181984 java.lang.String
13: 7234162 462986368 java.util.LinkedHashMap$Entry
14: 8025619 433916440 [Ljava.util.HashMap$Node;
15: 348788 384492232 [Ljava.lang.String;
16: 7825642 375630816 java.util.Collections$UnmodifiableMap
17: 3883204 248525056 org.apache.parquet.hadoop.metadata.IntColumnChunkMetaData
18: 4954717 237826416 java.util.HashMap$Node
19: 4359921 209276208 java.util.LinkedList
20: 359960 194986624 [Lio.prestosql.spi.block.Block;
21: 2434564 194765120 org.apache.parquet.schema.PrimitiveType
22: 4456340 178253600 java.util.LinkedList$Node
23: 2532452 162076928 org.glassfish.jersey.internal.util.collection.KeyComparatorLinkedHashMap$Entry
24: 3881814 155272560 org.apache.parquet.column.EncodingStats
25: 1437724 149523296 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask
26: 2836597 136156656 org.eclipse.jetty.http.HttpField
27: 427226 116205472 org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage
28: 800405 115258320 java.net.URI
29: 2011908 112666848 io.prestosql.spi.block.LongArrayBlock
30: 2649124 105964960 java.util.ArrayList
2020-02-20T16:39:02.405+0800: 1187.582: [SoftReference, 0 refs, 0.0000421 secs]2020-02-20T16:39:02.405+0800: 1187.582: [WeakReference, 23138 refs, 0.0040229 secs]2020-02-20T16:39:02.409+0800: 1187.586: [FinalReference, 398 refs, 0.0002524 secs]2020-02-20T16:39:02.409+0800: 1187.586: [PhantomReference, 0 refs, 10806 refs, 0.0009909 secs]2020-02-20T16:39:02.410+0800: 1187.587: [JNI Weak Reference, 0.0037439 secs] 332G->135G(350G), 72.6502685 secs]
[Eden: 0.0B(17.5G)->0.0B(17.5G) Survivors: 0.0B->0.0B Heap: 332.0G(350.0G)->135.8G(350.0G)], [Metaspace: 186703K->176459K(256000K)]
Heap after GC invocations=499 (full 1):
garbage-first heap total 367001600K, used 142347592K [0x00007ef890000000, 0x00007ef892015e00, 0x00007f5010000000)
region size 32768K, 0 young (0K), 0 survivors (0K)
Metaspace used 176459K, capacity 225216K, committed 254720K, reserved 256000K
}
2020-02-20T16:39:53.051+0800: 1238.228: [Class Histogram (after full gc):
num #instances #bytes class name
----------------------------------------------
1: 3689146 69481974544 [J
2: 1261780 47544827216 [B
3: 3009085 22499458120 [I
4: 1819453 3133862920 [Z
5: 606548 422337944 [C
6: 345202 386081352 [Ljava.lang.Object;
7: 83674 339502976 [Ljava.lang.String;
8: 49606 327601208 [D
9: 7648893 305955720 org.apache.parquet.io.api.Binary$ByteArraySliceBackedBinary
10: 315022 147801784 [Lio.prestosql.spi.block.Block;
11: 1798665 100725240 io.prestosql.spi.block.LongArrayBlock
12: 1109348 70998272 io.prestosql.spi.block.VariableWidthBlock
13: 1400 61224744 [Lorg.apache.parquet.io.api.Binary;
14: 1060502 59388112 io.prestosql.spi.block.IntArrayBlock
15: 1005316 56297696 io.airlift.slice.Slice
16: 763044 36626112 io.airlift.stats.DecayCounter
17: 1523039 36552936 java.util.concurrent.atomic.AtomicLong
18: 950995 30431840 java.lang.String
19: 833893 26684576 io.prestosql.spi.block.BlockBuilderStatus
20: 1756 20542040 [[J
21: 313047 20035008 io.prestosql.spi.block.LongArrayBlockBuilder
22: 299678 19179392 io.prestosql.spi.block.Int128ArrayBlockBuilder
23: 171139 18011256 [Ljava.util.HashMap$Node;
24: 337481 16199088 java.util.HashMap$Node
25: 262498 12599904 org.antlr.v4.runtime.atn.ATNConfig
26: 250316 12015168 io.airlift.stats.CounterStat
27: 233252 11196096 java.util.concurrent.ConcurrentHashMap$Node
28: 123541 10871608 java.util.LinkedHashMap
29: 5670 10508656 [Lorg.antlr.v4.runtime.dfa.DFAState;
30: 128778 10302240 io.prestosql.spi.block.VariableWidthBlockBuilder
It will take more than 60s!
What Java version are you using?
Also, are you running the most recent release?
The Java version:
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
The Presto version is Release 310.
Couple of things I would try:
See if you get better performance with Java 11
Ideally combined with an upgrade to latest Presto 330 release, there were a few parquet and other performance improvements since 310 that might benefit you
remove alluxio to reduce complexity of the problem and see if that changes behavior
If the problem still persists then, you might have to tweak JVM settings for GC. E.g. there might be an issue with your metaspacesize config (e.g. see https://stackoverflow.com/questions/31075761/java-8-reserves-minimum-1g-for-metaspace-despite-maxmetaspacesize#31104372)
Thank you so much for your advice @mosabua
I have tried the things that you suggest, but the problem still exists.
Fortunately, I upgrade jdk to version 11 and use ZGC the problem has been sovled.
This(ZGC) is really an amazing gc!
Btw in case others are finding this issue.
You will want to use JDK 12+ as ZGC in JDK 11 does not support class unloading. Presto generates classes for every query, so without class unloading, you'll leak memory due these generated classes. Your heap is large enough that you might not notice this problem for a while, depending on your query patterns (generated classes are cached for a period of time).
We'd love to hear more about your experiences with running Presto with ZGC, since this is new and we haven't heard any reports yet. Please do follow up and let us know how things go (good or bad).
Most helpful comment
We'd love to hear more about your experiences with running Presto with ZGC, since this is new and we haven't heard any reports yet. Please do follow up and let us know how things go (good or bad).