Kotlin-dsl-samples: KT-28037 - Memory leak in Kotlin Gradle Plugin when using `in-process` strategy

Created on 1 Nov 2018  路  16Comments  路  Source: gradle/kotlin-dsl-samples

Today I noticed a Out of memory: Metaspace on our CI: https://builds.gradle.org/viewLog.html?buildId=17204292&buildTypeId=Gradle_Check_Platform_Java10_Oracle_Windows_buildInit

After analyzing the dump https://drive.google.com/open?id=1ejlar1v47BOo7iVXE0bHrxEBANmuCilw it seems like some memory leak in Kotlin. I'm not 100% sure - @eskatos and @bamboo may have more insights on this.

There're several suspicious classloaders which each holds 40000+ classes:

image

Almost all of them connect to the GC root via some kotlin classes. Plus, these failures happen in kotlin init tests, so I think it might be related to kotlin dsl.

image

bug jetbrains kt-compiler performance

Most helpful comment

Using ThreadLocal without calling remove when you are done is a bug. You either need a lifecycle where you call remove at some point (e.g. when compilation is finished) or you can't use a ThreadLocal. There are several more instances of this problem in the Kotlin plugin/compiler. This will become a bigger issue in Gradle 5.0, since we now limit metaspace by default.

We may want to consider this a blocker for 5.0 if this affects the Kotlin DSL

All 16 comments

image

My guess is:

Somehow, we use Daemon worker thread to load ConcurrentWeakKeySoftValueHashMap class, then a a strong reference to ConcurrentWeakKeySoftValueHashMap$HardKey is retained in the worker thread's threadLocals field via this line because first get() operation will set initialValue. As long as the worker thread exists, this path to GC root will exist. I still don't know where ConcurrentWeakKeySoftValueHashMap comes from.

Heapdump isn't publicly visible just FYI.

@JLLeitschuh Thanks, I'm aware of this - seems like I can't share it publicly - I didn't find the option. I'd like to share with anyone if you apply for permission.

Using ThreadLocal without calling remove when you are done is a bug. You either need a lifecycle where you call remove at some point (e.g. when compilation is finished) or you can't use a ThreadLocal. There are several more instances of this problem in the Kotlin plugin/compiler. This will become a bigger issue in Gradle 5.0, since we now limit metaspace by default.

We may want to consider this a blocker for 5.0 if this affects the Kotlin DSL

A quick option that doesn't require a new Kotlin release: Reflectively remove all the threadlocals when the build session ends.

Two questions:

  1. Isn't the Kotlin Compiler run in it's own daemon and, as such, inaccessible with reflection?
  2. Is there a YouTrack issue where this problem has been reported?

@JLLeitschuh, a daemon is reused running several build sessions, each build session end can be handled by the daemon. No youtrack issue yet afaik but Kotlin folks are aware of this one.

@oehme, it affects the Gradle Kotlin DSL when using the kotlin-dsl plugin, e.g. in buildSrc. It also affects all builds using the kotlin-gradle-plugin.

The leak is in the main Gradle daemon though, not in the compiler worker. So there must be some Kotlin code called in the main daemon.

Issue in the Kotlin issue tracker: https://youtrack.jetbrains.com/issue/KT-28037

@eskatos A few of these classloaders are suspicious: "ClassLoaderScopeIdentifier.Id{root:C:\tcagent1\work\668602365d1521fc\subprojects\build-init\build\tmp\test files\unknown-test-class\q4upj\some-thing\buildSrc:root-project(export)}".

Why is the main daemon creating a classloader scope for something that is clearly the output of a test?

Edit: Nevermind, the heap dump is not from the main daemon - It's from a test VM, running the embedded Gradle executer. That also explains the many classloaders with the same Kotlin version - It's many different projects being built by the same process.

FWIW and AFAIK the failure happened on Windows only on our CI:

  • on master

    • once on October 29th

    • twice on November 1st (1, 2)

  • on release

    • once on October 31st

Note that the above are only the failures in the buildInit project. I've seen others, including our main daemon failing.

Another instance
https://builds.gradle.org/viewLog.html?buildId=17396138&buildTypeId=Gradle_Check_Platform_Java10_Oracle_Windows_buildInit

That's :buildInit integration tests again. I just noticed that some tests in there set -Dkotlin.compiler.execution.strategy=in-process which will create a leak in the Gradle daemon from the kotlin-gradle-plugin usage of the kotlin compiler. Because those daemons run several different builds, the kotlin compiler classes are loaded from many different classloaders.

@oehme there're also tests in :modelCore that set that problematic property, is it there you also observed the leak?

No more failures due to metaspace exhaustion can be observed since I removed the in-process strategy

Was this page helpful?
0 / 5 - 0 ratings