Tokio: Possible regression in 0.2.14 and further (hangs, stack overflow)

Created on 20 Apr 2020  路  7Comments  路  Source: tokio-rs/tokio

Version

0.2.14 and higher (up to 0.2.18 so far)

Platform

  • Linux ... 5.6.3-arch1-1 #1 SMP PREEMPT Wed, 08 Apr 2020 07:47:16 +0000 x86_64 GNU/Linux
  • Linux ... 5.6.5-1.el7.elrepo.x86_64 #1 SMP Thu Apr 16 14:02:22 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Description

Disclaimer

Unfortunately, I am currently unable to provide an MRE or even a code sample (NDA) that causes the issue, but still I'm writing an issue in case somebody has experienced what I had to experience.

Problem

I'm developing a reverse-proxy-like application, and after updating tokio from 0.2.13 to 0.2.18 I've found that my app hangs, while consuming 100% of a CPU core (out of many cores). As I mentioned before, I can not disclose all the details, but in general the app does the following things:

  • Establishing lots of "internal" connections (tcp, uds, about 500) simultaneously
  • Establishing more "internal" connections on demand (about 1.5k) _almost_ simultaneously
  • Re-establishing broken "internal" connections (e.g. when the endpoint terminates the connection unexpectedly)
  • Accepting incoming connections (tcp, uds) from "external" clients
  • Proxying data (ratio ~ 8:1, i.e. 1 "internal" connection mentioned above is shared among ~8 "external" clients)

Under the hood, I use FuturesUnordered and Selects from futures-util and do a lot of polls manually in the order I found to be most suitable. I don't spawn anything and use default tokio runtime (via #[tokio::main] macro).

After upgrading to tokio-0.2.18 I've found that my app established about 100 connections to the internal servers and hangs completely, consuming 100% of a CPU core. All the attempts establish a connect to the port it listens to fail because of timeout.

I though then "okay, there's being a major upgrade in tokio's scheduler in 0.12.14, so probably i must not manage all the futures myself and just spawn the tasks!", so I've replaced FuturesUnordered and Selects with spawns and yay! It seemed to have solved the whole issue.

.. until a lot of "internal" servers went offline and ...... the connections where scheduled to be re-established and I've got a stack-overflow error.

So I had to downgrade to tokio-0.2.13, where everything just works (tm).

My question is, how do I investigate the root cause of my issue? Where should I look at first?

Eventually, I would like to provide an MRE, but so far It's just a cry for help :)

Thanks!

A-tokio C-question I-hang M-runtime S-waiting-on-author

Most helpful comment

@Darksonn @carllerche Thanks for the advises! I'll test both my stack-allocated stuff and threads snapshot and return with more facts in a couple of days

All 7 comments

Are you by any chance using the futures v0.1 FuturesUnordered or one of the early v0.3 versions from before this PR?

Nope, only 0.3

Well there have been a few issues with hangs after this got introduced, which exposed a collection of buggy sub-schedulers such as FuturesUnordered (#2047) or Shared (#2130). The former has been fixed in futures version 0.3.2, and the latter has not yet been fixed.
Using futures::executor::block_on from within an async function falls in this category.

If you application hangs, it's likely due to such a buggy sub-scheduler somewhere. As for the stack-overflow, that sometimes happens when people try to make big stack arrays, e.g.:

let mut buf = [0; 4096];
stream.read(&mut buf).await?;

and stuff like this. This should be avoided in futures, because they make the future object massive, which can cause the call to tokio::spawn itself to stack overflow due to moving a very big object a few stack frames down. You should use a vector instead.

Of course, it could also just be an infinite recursive loop. Your backtrace would probably tell you in that case.

Do a snapshot of the process (thread stacks) when stuck at 100% CPU. That should show which fn it is stuck in.

@Darksonn @carllerche Thanks for the advises! I'll test both my stack-allocated stuff and threads snapshot and return with more facts in a couple of days

I was wondering if you had any further details on this issue? If not, I will have to close the issue due to lack of details.

Yep, I guess let's close it. Since I've got rid of all my "custom" futures with twisted logic and shifted on all the hard work to tokio (via spawn and its friends) I don't see any issues.

So my best guess for now is that I've made some mistakes in the app's logic initially, and it just "happened" to work as expected.

Thanks everyone for attention and sorry I was not able to provide further details

Was this page helpful?
0 / 5 - 0 ratings