Kotlinx.coroutines: Possible bug with `ThreadLocal.asContextElement`

Created on 15 Feb 2019 · 33Comments · Source: Kotlin/kotlinx.coroutines

Hello,
I'm observing a strange behavior with the following code:

    val t: ThreadLocal<String> = ThreadLocal.withInitial { "a" }

    @Test
    fun testContext1() = runBlocking {
        println(t.get())
        withContext(coroutineContext.plus(t.asContextElement("b"))) {
            println(t.get())
        }
        println(t.get())
    }

    @Test
    fun testContext2() = runBlocking {
        println(t.get())
        withContext(coroutineContext.plus(t.asContextElement("b"))) {
            println(t.get())
            delay(Random.nextLong(100))
            println(t.get())
        }
        println(t.get())
    }

    @Test
    fun testContext3() = runBlocking {
        println(t.get())
        async(coroutineContext.plus(t.asContextElement("b"))) {
            println(t.get())
            delay(Random.nextLong(100))
            println(t.get())
        }.await()
        println(t.get())
    }

The result is:

testContext1 prints aba - expected
testContext2 prints abbb - It looks like the value is not restored for some reason. Is this a bug? What's the explanation?
testContext3 prints abba - expected

Please help!

I'm using org.jetbrains.kotlinx:kotlinx-coroutines-core:1.1.1

bug

Source

alamothe

👍5

Most helpful comment

We assume that kotlinx-coroutines-slf4j is used in the following way. Consider the application that is using SLF4J for logging, uses coroutines, and uses MDC. This application adds kotlinx-coroutines-slf4j to its dependencies and adds MDCContext(emptyMap()) to its top-level scope to give a "signal" to coroutine machinery that it is going to use MDCContext. After that, its MDC will be always correctly set/restored and reset and switching between different coroutines.

Until there's a new release, it would be great to have a clear example of what this "expected usage" looks like. Specifically, the documentation for kotlinx-coroutines-slf4j is extremely short and could benefit from a warning about the caveat caused by this bug and an example of a workaround. Might be best tracked as a separate issue, not sure.

luispollo on 28 Feb 2021

👍3

All 33 comments

It is not exactly a bug, but a side-effect of the way asContextElement is implemented. See, your main coroutine (in runBlocking) does not have t's context element, so this is interpreted as "main coroutine do not care about this thread local". So then, depending on specific ways you get back to the main tread, the value of t can be either restored or not restored for the main coroutine.

If you write runBlocking(t.asContextElement("a")) { ... } in all your tests, then they all are going to run as expected.

Also note, then when you use withContext function you don't have to write withContext(coroutineContext.plus(other)) { ... }, because that merging of current context with the other context is exactly what withContext does (see its docs), so you can simply write withContext(other) { ... }.

Does it help?

elizarov on 15 Feb 2019

Hey @elizarov thank you for your reply & the tip for withContext.

I changed the test like this:

    @Test
    fun testContext2() {
        println(t.get())
        runBlocking {
            println(t.get())
            withContext(t.asContextElement("b")) {
                println(t.get())
                delay(Random.nextLong(100))
                println(t.get())
            }
            println(t.get())
        }
        println(t.get())
    }

It prints a a b b b a.

Are you basically saying that, between the outermost asContextElement and the end of the coroutine, the value of ThreadLocal is undefined? I find that very troubling. In my opinion, such undefined behavior could lead to disastrous bugs.

How about after the coroutine? Is it guaranteed to be restored?

alamothe on 16 Feb 2019

Consider also this case where the value depends on previous value (not uncommon):

    val tInt: ThreadLocal<Int> = ThreadLocal.withInitial { 0 }

    @Test
    fun testContext4() {
        println(tInt.get())
        runBlocking {
            println(tInt.get())
            withContext(tInt.asContextElement(tInt.get() + 1)) {
                println(tInt.get())
                // Comment for other behavior
                delay(Random.nextLong(100))
            }
            println(tInt.get())
            withContext(tInt.asContextElement(tInt.get() + 1)) {
                println(tInt.get())
                // Comment for other behavior
                delay(Random.nextLong(100))
            }
            println(tInt.get())
        }
        println(tInt.get())
    }

So now in the second withContext, the value could be both 1 or 2. In practice, withContext can be deep in the stack and have no idea if it is outermost asContextElement or not.

alamothe on 16 Feb 2019

If I understand correctly, the only workaround is to remember to set as a context element every time a coroutine is started. But what if that's happening at 100 places (not uncommon, consider the codebase which is migrating to coroutines)

alamothe on 16 Feb 2019

The design principle of t.asContextElement(value) is this -- a coroutine with this element _in its context_ will have value set in the corresponding thread local. Indeed, there are no guarantees for coroutines that do have this element in their context, as they are not supposed to ever care about it. If they do care about this thread-local, they should have it in their context.

You don't need to remember to always set it if you use the structured concurrency approach and always launch your coroutines as children of other coroutines. Once you set the corresponding element in the context of your top-level scope, it is going to be inherited by all your coroutines.

elizarov on 22 Feb 2019

You don't need to remember to always set it if you use the structured concurrency approach and always launch your coroutines as children of other coroutines. Once you set the corresponding element in the context of your top-level scope, it is going to be inherited by all your coroutines.

What I was referring to is usage in a servlet container. In this case, each request will start outside of a coroutine (as a separate thread), and then a coroutine will be started at some point via runBlocking. Code outside of a coroutine will care about a thread-local value. To be able to access it both inside and outside of a coroutine - isn't that precisely the idea behind ThreadLocal.asContextElement?

So in this scenario, does the implementation guarantee that the value will be restored? In my example, outside of runBlocking? (my tests do show the value being restored after runBlocking)

I hope the answer is yes. But even so, you should realize the inconsistency here:

Value of a thread-local is well-defined inside a coroutine which has it in its context.
Value of a thread-local is well-defined outside of coroutine.
Value of a thread-local is undefined inside a coroutine which doesn't have it in its context.

Isn't it well reasoned to expect 2 and 3 to act the same? Outside of a coroutine there is no context at all. It's troublesome that the behavior is different. I am sure that someone will get bitten by this.

alamothe on 25 Feb 2019

@alamothe The value of thread-local is guaranteed to be restored when you go outside of the scope of any of the coroutines. If you write the following code:

// not in a coroutine here
runBlocking(t.asContextElement()) {

}
// the value of t is guaranteed to be restored here

elizarov on 25 Feb 2019

Got it. The way I solved it is to explicitly add to context every time I start a coroutine like that, and even implement as a helper function.

I still consider this behavior unintuitive, and suggest adding a warning to the docs, stating that "the value of a thread-local is undefined inside a coroutine which doesn't have it in its context", with some examples.

alamothe on 25 Feb 2019

It would also be nice to add the warning to the documentation of ThreadContextElement too.
At the moment it does not mention that restoreThreadContext() may not be called when it would be expected by users implementing ThreadContextElement themselves.

And it would be much better to have thread state restored outside withContext { }, because a library may not have control over initial context of the coroutine, so this feature becomes unusable there.

frost13it on 24 Sep 2019

👍1

The kotlinx-coroutines-slf4j module seems to be a bit broken because of this problem.
Running the following snippet:

import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.slf4j.MDCContext
import kotlinx.coroutines.withContext
import org.slf4j.LoggerFactory

private val logger = LoggerFactory.getLogger("MainKt")

suspend fun main() {
    logger.info("Before operation")
    withContext(MDCContext(mapOf("operationId" to "123"))) {
        withContext(Dispatchers.IO) {
            logger.info("Inside operation")
        }
    }
    logger.info("After operation")
}

Procuces the following output:

[] INFO  MainKt - Before operation
[operationId=123] INFO  MainKt - Inside operation
[operationId=123] INFO  MainKt - After operation

Here the expression in square brackets is Logback's %mdc.

As you can see, MDC is not cleaned up after exiting from withContext {}. And there is no top-level scope where MDCContext can be injected.

Even when this is possible, it is not convenient at all to put "empty" value of thread-local state in all root coroutines to ensure that it will not left undefined after running some code. Could you please fix it?

Should I create a separate issue and is there a chance that this design of ThreadContextElement will be changed in future and such usages will work?

frost13it on 25 Sep 2019

@frost13it Thanks for a use-case. I've reopened this way. Let's see if it would be somehow possible to fix it (ideally, in an efficient way).

elizarov on 25 Sep 2019

@frost13it I've looked at this problem again. Can you, please, elaborate on your actual use-case? In particular, you are giving an example where thread local value leaks into suspend fun main. Is it your real use-case?

Note, that you can wrap your top-level code into withContext(MDCContext(emptyMap())) { ... }. If you do this, you should see that your operationId is cleared. Would it be a solution for you?

elizarov on 25 Sep 2019

Unfortunately it is not a good solution for me.
I'm writing a logging library. What am I trying to achieve is to propagate contextual data (called LoggingContext) in a call graph consisting of both suspend and non-suspend functions without passing it explicitly.
Coroutine context allows to achieve that in suspend code, and ThreadLocal allows to achieve that in non-suspend single-threaded code. Provided that suspend functions can only be called from other suspend functions, I see the following solutions for this:

Take LoggingContext from the context of the coroutine currently running on the current thread. It would be possible if there was a ThreadLocal containing CoroutineContext of that coroutine. But as far as I know, no such ThreadLocal exists at the moment.
Make thread local value of LoggingContext match that in the coroutine context.
It seemed that ThreadContextElement fits perfectly here. But it is hard to guarantee that context of top-level coroutines will contain an initial value of LoggingContext. Forcing users of the library to add it manually is inconvenient and error-prone. Especially in the case with suspend fun main().
That is why I would love to see this issue fixed.

Don't you think that kotlinx-coroutines-slf4j (which works the same way) is pretty difficult to use correctly because of this issue? I'm afraid that for now the same is true for any similar library.

frost13it on 25 Sep 2019

We assume that kotlinx-coroutines-slf4j is used in the following way. Consider the application that is using SLF4J for logging, uses coroutines, and uses MDC. This application adds kotlinx-coroutines-slf4j to its dependencies and adds MDCContext(emptyMap()) to its top-level scope to give a "signal" to coroutine machinery that it is going to use MDCContext. After that, its MDC will be always correctly set/restored and reset and switching between different coroutines.

Unfortunately, this does not currently affect GlobalScope and requires some boiler-plate code from the application writer. E.g., if my application has code like val myScope = CoroutineScope(someContext) then I, as an application author, have to remember to add MDCContext(emptyMap()) to this scope's context.

What we can do to improve this situation is to give integration modules like kotlinx-coroutines-slf4j and ability to globally inject context elements to all scopes (including GlobalScope). This should considerably improve usability. All you'll need to do in this case is to add kotlinx-coroutines-slf4j to dependencies and that's it. What do you think?

P.S. The downside of this approach to consider is that anyone who adds kotlinx-coroutines-slf4j to their dependencies will get a performance hit even if they don't use MDC (as an afterthought -- there's no other reason to add kotlinx-coroutines-slf4j, but for MDC support).

elizarov on 26 Sep 2019

I think it would be sufficient in my case. It would be great to have such an ability in a library.

frost13it on 26 Sep 2019

Are there any updates on this issue?

frost13it on 8 Nov 2019

👍2

Hello, can you please share the latest plan?

alamothe on 22 Feb 2020

There is a PR #1577 that address this problem, but we have not yet made a decision to merge it, since it has quite non-trivial performance and future code maintenability implications. How bad the problem is affecting you? Maybe we could merge it under the flag to let you experiment with solution or something like that?

elizarov on 13 Mar 2020

This problem prevents me from implementing tracing and a coroutine-friendly MDC alternative.
The issue causes ThreadContextElement to be unreliable, because absence of an element in any of application's coroutines may lead to unpredictable results.
It seems hard to achive running all the application code (including non-Kotlin-aware libraries and frameworks) inside some coroutine and strictly adhere to structured concurrency.

As for "feature flag", it would be suitable to enable the feature via some classpath resource, to make adding a dependency on a library sufficient to use it.

frost13it on 13 Mar 2020

@elizarov Are there any news?

frost13it on 26 May 2020

I've rebased PR #1577 onto the latest develop branch. We have not had a chance to evaluate its performance impact and make a decision yet. Maybe we can make it optional.

elizarov on 27 May 2020

I also have the same problem, specifically with MDC. would be happy to see a neat solution.

p.s: @frost13it I wonder your logging solution, if it is open-source, could you share it?

kemalizing on 2 Jun 2020

@kemalizing It is planned to be open-source, but I can't release it yet

frost13it on 6 Jun 2020

Thanks @elizarov for your work and everyone else in the discussion.

We have solved this by making sure every time we start a coroutine, we put all thread locals that we care about in its context.

But I totally understand how this doesn't work for libraries.

alamothe on 8 Jun 2020

Hello,

I am experiencing the same issue even with the code base that is not huge and is mostly under our control. Our services use MDC logging, opencensus tracing, thread locals, and GRPC contexts. It has been a huge challenge to wire everything up correctly to work with coroutine context propagation.

For the code under our control, we need to remember to seed all the different context elements every time somebody starts a coroutine. This leads to tight coupling, every piece of code that wants to start a coroutine needs to be aware of all the thread context elements in the entire codebase. And of course, it is very error-prone.

The problem is exacerbated by GRPC launching coroutines without giving everyone the opportunity to override the behavior. I imagine other libraries wanting to do the same.

The described behavior is surprising, to say the least. I don't totally understand why global injection is needed to fix this. In the examples above, there are clear scope/context entry and exit points. Could you explain why the restoreThreadContext is not invoked at the block's end? It's especially surprising behavior because if you remove delay invocation from the examples above, withContext works as expected!

akalini on 28 Jul 2020

👍1

One more request: https://youtrack.jetbrains.com/issue/KT-42582

qwwdfsad on 20 Oct 2020

Is there a decision on the #1577?

frost13it on 1 Feb 2021

Yes, we are going to release the fix with the next release

qwwdfsad on 1 Feb 2021

🎉2

Magnificently!

frost13it on 1 Feb 2021

Thanks for fixing this! I've been pulling my hair trying to understand why my MDC values were screwed up for months now without coming across this thread. Is there an ETA on the next release?

luispollo on 28 Feb 2021

We assume that kotlinx-coroutines-slf4j is used in the following way. Consider the application that is using SLF4J for logging, uses coroutines, and uses MDC. This application adds kotlinx-coroutines-slf4j to its dependencies and adds MDCContext(emptyMap()) to its top-level scope to give a "signal" to coroutine machinery that it is going to use MDCContext. After that, its MDC will be always correctly set/restored and reset and switching between different coroutines.

luispollo on 28 Feb 2021

👍3

I'm seeing a strange side-effect which I believe is the result of this change. Using MDCContext will remove keys from the MDC if you call a suspendable function, which is a bit surprising, when in this simple example case you just call delay:

Without MDCContext:

runBlocking {
    MDC.put("key", "value")
    println(MDC.copyOfContextMap)
    delay(50)
    println(MDC.copyOfContextMap)
}

# {key=value}
# {key=value}

With MDCContext:

runBlocking(MDCContext()) {
    MDC.put("key", "value")
    println(MDC.copyOfContextMap)
    delay(50)
    println(MDC.copyOfContextMap)
}

# {key=value}
# null

cyberdelia on 27 Mar 2021

Yes, this is the existing limitation of MDCContext exposed more strictly.

Quoting our KDoc:

Note that you cannot update MDC context from inside of the coroutine simply using MDC.put. These updates are going to be lost on the next suspension and reinstalled to the MDC context that was captured or explicitly specified in contextMap when this object was created on the next resumption.

This is a by-design limitation, that become much more clearer when more than one coroutine is launched.

E.g.:

runBlocking((MDCContext()) {
    launch(someDispatcher) {
        // mutate MDC
    }.join()
    // use MDC here as well
}

do you expect mutated MDC to be available in runBlocking coroutine. What if someDispatcher is Dispatchers.Unconfined and what if it's concurrent? And what if it's concurrent, but it happened to be dispatched on the same thread as runBlocking coroutine?

qwwdfsad on 27 Mar 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Kotlin 1.3.60 support for native part

IgorKey · 3Comments

Correct way to implement MainDispatcher and BackgroundDispatcher in iOS

petersamokhin · 3Comments

Best practices, should use JavaNIO and suspendCoroutine or JavaIO and Dispatchers.IO

ddeya · 3Comments

Room integration

elizarov · 3Comments

LiveDataScop emit not calling

jaozinfs · 3Comments