Skip to content

That time I broke KotlinX.Coroutines

A story about contributing to open source, how the community finds bugs before they reach customers, and a weird edge case I didn't know about.

A bit of context

This story starts in 2023: Formulaide 1.0 has long been deployed, and I'm working on the 2.0 version. One of the major differences is the migration from Kotlin React to Compose HTML—but the one most relevant to this story is that many core components of Formulaide 1.0 have been extracted to standalone libraries.

Among them, Pedestal Cache is the flagship: a collection of pure Kotlin, multiplatform cache algorithms based on KotlinX.Coroutines. They are designed to be lightweight so they can be embedded at the core of applications: for example, I use them with very short expiration times (~1 minute) to make authentication virtually free.

And, they work very well. They're very fast, are very simple to use, and work the same server-side and client-side. Except one day, one of my tests stops working, specifically on Kotlin/JS:

IllegalStateException: This mutex is not locked
    at protoOf.unlock_uksyr8(/tmp/_karma_webpack_931909/commons.js:67235)
    at protoOf.doResume_5yljmg(/tmp/_karma_webpack_931909/commons.js:3862)
    at protoOf.invoke_qflhgo(/tmp/_karma_webpack_931909/commons.js:3817)

This stacktrace is not particularly convenient to work with. First, because my training is around the JVM, and I'm not used to debugging things in the JS ecosystem. Second, this is happening inside the JS code generated by the Kotlin compiler. And, third, because I never actually call Mutex.unlock.

What I do call is Mutex.withLock, which is the function that automatically acquires the lock, performs an operation, and releases it, even if an exception is thrown. My code looks like:

val counter = get(id)
lock.withLock {
    data[id] = counter.copy(canRead = counter.canRead + user)
}

But that error would only be thrown if unlock was called twice—so, what's happening?

Understanding the problem

Being completely lost, I report the problem to the KotlinX.Coroutines team. Dmitry Khalanskiy immediately manages to create a reproducer and declares it a bug of the Kotlin compiler itself: KT-58685 is born. Indeed, a finally block inside an inline function that is itself called within a try…catch, is called twice. Of course, this happens to be the signature of the Mutex.withLock function:

// Simplified reproducer 

private suspend inline fun <T> withLock(block: () -> T): T {
    lock()
    try {
        return block()
    } finally {
        unlock()  // a finally both within an inline function…
    }
}

fun reproduce() {
    try {
        withLock {  // …that it itself called within a try…catch!
            null
        } ?: throw Exception("seems fine")
    } catch (e: Exception) {
        throw e
    }
}

In this particular example, in Kotlin 1.8.21 to 1.9.20, the unlock function is called twice!

After a few months of back-and-forth, Robert Jaros (author of KVision) creates a workaround: if the finally is the problem, why not simply remove it?

private suspend inline fun <T> withLock(block: () -> T): T {
    lock()
    val result = try {
        block()
    } catch (e: Throwable) {
        unlock()
        throw e
    }
    unlock()
    return result
}

This code isn't as beautiful as the original version, but it does fix the problem, right? After all, finally is just a shortcut to execute a piece of code in both success and failed cases.

After applying the workaround on my projects, I start to move to contributing it to KotlinX.Coroutines.

Contributing to KotlinX.Coroutines

At this point, the bug isn't fixed in the compiler. To avoid breaking more code until then, and because some people may continue using older compiler versions for a while, I create #3881, my first contribution to KotlinX.Coroutines. It describes the story of the problem this far, adding unit tests to trigger it, and applies Robert's workaround to Mutex.withLock and Semaphore.withPermit.

After a few rounds of review, it is merged on November 2nd, around six months after the initial discovery. The workaround is published in KotlinX.Coroutines 1.8.0-RC.

I must admit, rewriting code written by Roman Elizarov (the author of KotlinX.Coroutines, and, at the time, the Kotlin lead) feels strange. His code is particularly easy to read. At times, it may be extremely complex, but it is always well-organized and easy to navigate. In this PR, I was (knowingly) making it worse: less readable, less obvious. But I had a good reason, I was fixing a bug!

Everyone is happy and everything is well

And it worked, there were no issues, and everyone clapped!

Who am I kidding—you read the title. Let's play a game: do you think you would've caught the bug during review? I certainly wouldn't have, and indeed I didn't. This is definitely one of those that will stay in a corner of my head forever.

Here's a tip: it's somewhere in this change.

Before: using a finally block
private suspend inline fun <T> withLock(block: () -> T): T {
    lock()
    try {
        return block()
    } finally {
        unlock()
    }
}

After: inlining the finally block
private suspend inline fun <T> withLock(block: () -> T): T {
    lock()
    val result = try {
        block()
    } catch (e: Throwable) {
        unlock()
        throw e
    }
    unlock()
    return result
}

Genuinely, try to find it. Don't give up too soon ♪


If you've written Kotlin semi-proficiently before, you already know what breaks this, but you probably haven't realized the consequences of that feature. If you've never written Kotlin before… I'm sorry for teasing you, and I'm sorry that is probably going to sound horrifying.

Well, here I go. This is an inline function that takes a lambda parameter. That parameter isn't marked noinline nor crossinline. This means that it supports non-local returns.

A non-local return is when a return statement from within an inline function can return from the caller function. This may seem frightening at first, but we really do use them everywhere:

fun hasZeros(ints: List<Int>): Boolean {
    ints.forEach {
        if (it == 0) return true // returns from hasZeros
    }
    return false
}

And, of course, users can do the same with withLock:

mutex.withLock {
    if (protectedValue == null)
        return
}

This means that there are actually three ways withLock can finish executing:

private suspend inline fun <T> withLock(block: () -> T): T {
    lock()
    val result = try {
        block() //(3)!
    } catch (e: Throwable) {
        unlock()
        throw e //(2)!
    }
    unlock()
    return result //(1)!
}

  1. The flow of execution can stop here in the 'normal' case: no exceptions were thrown and no non-local returns were used.
  2. The flow of execution can stop here if an exception is thrown in the body of block.
  3. The flow of execution can stop here if a non-local return is used. The rest of the function will not be executed!

In idiomatic Kotlin code, the finally block is actually extended to also execute when non-local returns are used, so it actually handles all three cases presented here. Most Kotlin developers are thus completely unaware of this edge case. But well, I removed the finally :)

Post-mortem

The bug was caught by Amejonah1200 when they started using KotlinX.Coroutines 1.8.0-RC in their projects. Thanks to them, this never reached production. After that, the KotlinX.Coroutines team handled the rest: they rolled back my changes, introduced a bunch of tests to ensure no one would forget about non-local returns in the future, and published 1.8.0. Since Kotlin 1.9.30 had been released in the meantime, and contained the fix for the original compiler bug thanks to Artem Kobzar, there was no need for the workaround anymore.

On my part… well, I learned one more trap to catch during code review. I haven't found any other bugs this inconvenient or weird since, and I'm glad that this is an extremely rare occurrence.

In my opinion, the mistake here is assuming that two constructs are equivalent, when one of them is purpose-built for a use-case, and the second isn't. Time and time again, we see that library authors and language designers are one step ahead (because they encounter the problems first), and all of this had already been figured out in the Kotlin language itself.