Graal: GraalVM: JVMCI-native CompilerThreads are RUNNABLE but not get processed

Created on 24 Sep 2020 · 6Comments · Source: oracle/graal

We were running some Presto jobs on GraalVM and we found on 1 of the worker node, Graal stopped compiling new methods for more than 1 day. From thread dump, we saw all JVMCI-native CompilerThreads were at RUNNABLE but not consume any CPU ('cpu' metric from thread dump stayed the same and only 'elapsed' increased). The issue seemed only apply to JVMCI-native CompilerThreads. C1 CompilerThread was normal and tiered 3 C1 compilations were triggered for those new methods. The worker process had a few thousands of threads, and JVMCI-native CompilerThreads and C1 CompilerThread seem have the same thread priority.

We don't have a way to reproduce the issue locally.

GraalVM version: GraalVM CE 20.2.0
JDK Major Version: 11
OS: CentOS
Architecture: AMD64

2020-09-23 11:04:29
Full thread dump OpenJDK 64-Bit Server VM (11.0.8+10-jvmci-20.2-b03 mixed mode):
...
"JVMCI-native CompilerThread0" #5 daemon prio=9 os_prio=0 cpu=24637125.92ms elapsed=494818.98s tid=0x00007fc2cbb1c390 nid=0x260 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   Compiling: 3454239       4       com.facebook.presto.$gen.PageProjectionWork_query2_2_0_42_542319$$Lambda$124339/0x00007f92ca2c84c0::apply (20 bytes)

   Locked ownable synchronizers:
    - None

"C1 CompilerThread0" #6 daemon prio=9 os_prio=0 cpu=4648829.46ms elapsed=494818.98s tid=0x00007fc2cbb1ddc0 nid=0x261 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   No compile task

   Locked ownable synchronizers:
    - None

"JVMCI-native CompilerThread1" #186307 daemon prio=9 os_prio=0 cpu=1534.03ms elapsed=67290.49s tid=0x00007f93250e46e0 nid=0x112610 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   Compiling: 3453174       4       com.facebook.presto.$gen.PageProjectionWork_query1_3_0_2_541981::evaluate (211 bytes)

   Locked ownable synchronizers:
    - None

"JVMCI-native CompilerThread2" #186329 daemon prio=9 os_prio=0 cpu=385.10ms elapsed=67236.36s tid=0x00007f9324b184c0 nid=0x112690 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   Compiling: 3454277       4       com.facebook.presto.$gen.PageProjectionWork_query2_2_0_42_542318::evaluate (105 bytes)

   Locked ownable synchronizers:
    - None

"JVMCI-native CompilerThread3" #186330 daemon prio=9 os_prio=0 cpu=286.95ms elapsed=67235.94s tid=0x00007f8b68159d20 nid=0x112696 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   Compiling: 3454279       4       com.facebook.presto.$gen.PageProjectionWork_query2_2_0_42_542319::evaluate (1721 bytes)

   Locked ownable synchronizers:
    - None

"JVMCI-native CompilerThread4" #186331 daemon prio=9 os_prio=0 cpu=1.54ms elapsed=67235.57s tid=0x00007f8b6404a950 nid=0x1126ae runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   Compiling: 3454288       4       com.facebook.presto.$gen.ArrayTransform_542329::transform (167 bytes)

   Locked ownable synchronizers:
    - None

"JVMCI-native CompilerThread5" #186332 daemon prio=9 os_prio=0 cpu=0.72ms elapsed=67235.56s tid=0x00007f8b640da4f0 nid=0x1126af runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   Compiling: 3454281       4       com.facebook.presto.$gen.PageProjectionWork_query2_2_0_42_542325::getexpr$cse (132 bytes)

   Locked ownable synchronizers:
    - None

bug

Source

lic9

👍1

All 6 comments

Is this related to https://github.com/openjdk/jdk/pull/35? we dont use -Xcomp or -Xbatch though.

lic9 on 24 Sep 2020

That looks like a deadlock within libgraal. Is it possible to attache gdb to this process to get the native stack trace of all threads?
@christianwimmer, @tkrodriguez , got any other suggestions on how to diagnose "stuck" libgraal threads?

dougxc on 24 Sep 2020

👍1

Thanks @dougxc ! I did try get some gdb info, but 'bt' didn't work for some reason. I'll see if I can find a way to reproduce it.

 3497 Thread 0x7f16a13db700 (LWP 549) "JVMCI-native Co" 0x00007f45ae789a35 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
  3496 Thread 0x7f16a11da700 (LWP 550) "C1 CompilerThre" 0x00007f45ae789de2 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0

(gdb) thread 3497
[Switching to thread 3497 (Thread 0x7f16a13db700 (LWP 549))]
#0  0x00007f45ae789a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7f16a13d9428:

lic9 on 24 Sep 2020

There isn't really a way to know what's happening without a stack trace. gcore'ing the process so we have a permanent copy of the state at the crash would be good before investigating the live process. Are you running with -Xbatch? If you aren't I don't think it can be related to openjdk/jdk#35.

tkrodriguez on 24 Sep 2020

There isn't really a way to know what's happening without a stack trace. gcore'ing the process so we have a permanent copy of the state at the crash would be good before investigating the live process. Are you running with -Xbatch? If you aren't I don't think it can be related to openjdk/jdk#35.

No, we use background compilation. So for https://github.com/openjdk/jdk/pull/35, it only unblocks the Java thread from a non-progress JVMCI thread, but the JVMCI thread will stay stuck?

The job we have did not crash. It's no Graal compilation progress, and all code running on C1. I'll try find some way to get the stack trace.

lic9 on 24 Sep 2020

No, everyone gets properly unblocked with that fix but it will only occur if -Xbatch or -Xcomp was used. Otherwise the regular Java threads never wait for compiler threads to complete and the JVMCI native threads shouldn't be able to deadlock each other in the same way. But maybe we have lock ordering problem in libgraal that we haven't thought of. Anyway, we'd need a native stack trace to understand what happened.

tkrodriguez on 25 Sep 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings