Given a TCP echo client that:
Running this client on the threaded runtime results in significant tail latencies.
use tokio::io::{AsyncReadExt, AsyncWriteExt};
use tokio::net::TcpStream;
use std::net::SocketAddr;
use std::sync::Arc;
use std::time::Instant;
#[tokio::main(core_threads=2)]
async fn main() {
const N: usize = 200;
const STEP: usize = N/10;
const BUFFER: usize = 8*1024;
let addr: SocketAddr = "127.0.0.1:8080".parse().unwrap();
let data = Arc::new(vec![10_u8; BUFFER]);
let mut handles = Vec::with_capacity(N);
let mut elapsed = Vec::with_capacity(N);
for i in 0..N {
if i % 10 == 9 {
tokio::task::yield_now().await;
}
let data = data.clone();
let now = Instant::now();
handles.push(tokio::spawn(async move {
let tts = now.elapsed();
let mut buf = vec![10_u8; BUFFER];
let mut socket = TcpStream::connect(addr).await.unwrap();
socket.write_all(&data).await.unwrap();
socket.read_exact(&mut buf).await.unwrap();
assert_eq!(buf[..], data[..]);
// now.elapsed()
tts
}));
}
for handle in handles.drain(..) {
elapsed.push(handle.await.unwrap());
}
elapsed.sort();
let mut i = STEP;
while i <= N {
println!("{}th = {:?}; tts", i, elapsed[i-1]);
i += STEP;
}
}
Running this against the echo example results in the following output:
20th = 241.19µs; tts
40th = 476.356µs; tts
60th = 697.187µs; tts
80th = 23.421715ms; tts
100th = 23.651342ms; tts
120th = 23.894434ms; tts
140th = 48.230702ms; tts
160th = 49.305577ms; tts
180th = 50.433882ms; tts
200th = 51.743111ms; tts
Compare this with basic_scheduler
20th = 112.65µs; tts
40th = 129.878µs; tts
60th = 145.256µs; tts
80th = 159.83µs; tts
100th = 176.598µs; tts
120th = 192.344µs; tts
140th = 209.232µs; tts
160th = 223.36µs; tts
180th = 243.565µs; tts
200th = 278.131µs; tts
This behavior is most likely a conflation of a number of factors. In the above example, the primary issue is spawning many tasks from the main function. When using the threaded scheduler, the main function runs outside of the scheduler. Spawned tasks are sent to the scheduler using the injection queue. This injection queue (MPMC) is slower than the scheduler's primary queue (SPMC).
When the scheduler is under load, it heavily prioritizes its local queue. In the above example, the scheduler is under load, so it prioritizes already spawned tasks instead of checking for new tasks. This results in the time to first poll for tasks to be very high.
This behavior can be verified by wrapping the contents of the main function with a spawn:
#[tokio::main(core_threads=2)]
async fn main() {
tokio::spawn(async {
}).await.unwrap();
}
Doing this changes the output to:
20th = 2.735µs; tts
40th = 41.12µs; tts
60th = 59.978µs; tts
80th = 80.126µs; tts
100th = 100.416µs; tts
120th = 136.696µs; tts
140th = 191.012µs; tts
160th = 244.484µs; tts
180th = 527.24µs; tts
200th = 29.068916ms; tts
And if we increase the number of threads to 8, we get:
20th = 2.007µs; tts
40th = 5.823µs; tts
60th = 18.551µs; tts
80th = 32.356µs; tts
100th = 39.56µs; tts
120th = 51.522µs; tts
140th = 72.05µs; tts
160th = 110.676µs; tts
180th = 160.459µs; tts
200th = 26.270611ms; tts
This is better. Notice how adding threads reduces the latencies compared to basic_scheduler. However, the maximum latency is surprisingly high (26ms) vs. basic_scheduler. I have not yet investigated why that is the case.
I believe the fix for the injection queue will require:
The current injection queue is fairly naive. It is a linked list guarded with a mutex. One option to consider is switching to an MPSC intrusive channel with a mutex guarding the head. This probably won't do too much. Workers probably want to acquire tasks from the injection queue in batches. Instead of popping one task at a time, when the local queue is not full, grab a bunch of tasks from the injection queue.
Heuristic wise, when the worker is under load, we may want to consider checking the injection queue more often. This may be less necessary if tasks are acquired in batches. If the injection queue does need to be checked more often, one option would be to check the queue every ~5 ticks if the last time the queue was checked there was a task.
The high tail latency after adding the spawn in the main fn is most likely not due to the injection queue. I believe adding the spawn should prevent the injection queue from being used as the total number of tasks is less than the local queue capacity (256). I do not know what is the cause for that behavior yet.
I spent more time investigating this.
I discovered the high latency spike is due to a call to libc::socket blocking for ~30ms. When creating sockets in a new, multi-threaded, process, the kernel will need to grow its internal FD table.
When running a multi-threaded process, the kernel needs to perform synchronization on the FD table. Synchronization is performed using RCU. In practice, the read lock on the FD table is released after "some period of time" after usage and not immediately after the read is complete. When socket is called and the FD table needs to grow in order to track the new FD, the thread must wait for all readers to complete in order to grow the table. This is the source of the hang in socket.
To diagnose, I used the following:
perf trace -e probe:alloc_fdtable,probe:free_fdtable_rcu,socket target/release/tmp-simple-client
[...]
0.965 ( ): tokio-runtime-/69619 probe:alloc_fdtable:(ffffffffa90ff090)
0.957 (28.139 ms): tokio-runtime-/69619 ... [continued]: socket()) = 64
0.978 (28.173 ms): tokio-runtime-/69618 ... [continued]: socket()) = 65
[...]
Additionally, running the test after "warming up" the FD table results in smooth execution:
20th - 2.92µs
40th - 19.048µs
60th - 36.003µs
80th - 49.98µs
100th - 59.227µs
120th - 66.215µs
140th - 81.66µs
160th - 96.405µs
180th - 110.253µs
200th - 291.598µs
"warming up" the FD table is done by duping sockets to create FDs.
FWIW, I think I've been seeing this exact same behaviour when using recvmsg on a Unix domain socket to receive lots of file descriptors (in an unrelated non-Rust project). I was also able to sort it out by using dup2() to force the FD table to be big enough to fit all sockets I'll receive.
Most helpful comment
I spent more time investigating this.
I discovered the high latency spike is due to a call to
libc::socketblocking for ~30ms. When creating sockets in a new, multi-threaded, process, the kernel will need to grow its internal FD table.When running a multi-threaded process, the kernel needs to perform synchronization on the FD table. Synchronization is performed using RCU. In practice, the read lock on the FD table is released after "some period of time" after usage and not immediately after the read is complete. When
socketis called and the FD table needs to grow in order to track the new FD, the thread must wait for all readers to complete in order to grow the table. This is the source of the hang insocket.To diagnose, I used the following:
full output
Additionally, running the test after "warming up" the FD table results in smooth execution:
"warming up" the FD table is done by
duping sockets to create FDs.