The following workload appears very slow - 10x slower than the equivalent sync and async-std code. Is this a bug, or am I doing something wrong?
Read a file in a thread, send the file data chunks through a channel to another thread, where an md5 hash is computed.
I put a fully running set of benchmarks here:
https://github.com/TheSven73/tokio_demo_slow
On a dual Intel i5-3340M, Linux 4.15 (Mint):
$ cargo bench
md5/Synchronous time: [148.33 ms 149.13 ms 150.20 ms]
md5/async-std time: [146.35 ms 146.81 ms 147.30 ms]
md5/tokio time: [1.2348 s 1.2609 s 1.2890 s]
$ cargo tree | grep tokio
└── tokio v0.3.4
└── tokio-macros v0.3.1
async fn tokio_benchmark_run(f: File) {
use tokio::{
fs::File,
io::AsyncReadExt,
sync::mpsc::channel,
};
let mut md5 = md5::Context::new();
let mut f: File = f.into();
let (tx, mut rx) = channel(1);
tokio::spawn(async move {
loop {
let mut buf = vec![0u8; BUF_SZ];
let bytes_read = f.read(&mut buf).await.unwrap();
tx.send((buf, bytes_read)).await.unwrap();
if bytes_read == 0 {
break;
}
}
});
loop {
let (buf, len) = rx.recv().await.unwrap();
if len == 0 {
break;
}
tokio::task::block_in_place(|| md5.consume(&buf[..len]));
}
}
Try:
async fn tokio_benchmark_run(mut f: File) {
use tokio::sync::mpsc::channel;
let mut md5 = md5::Context::new();
let (tx, mut rx) = channel(1);
tokio::spawn(async move {
loop {
let mut buf = vec![0u8; BUF_SZ];
let bytes_read = tokio::task::block_in_place(|| f.read(&mut buf).unwrap());
tx.send((buf, bytes_read)).await.unwrap();
if bytes_read == 0 {
break;
}
}
});
loop {
let (buf, len) = rx.recv().await.unwrap();
if len == 0 {
break;
}
tokio::task::block_in_place(|| md5.consume(&buf[..len]));
}
}
I get similar results across the three approaches.
Though this does not answer the question why the tokio::fs is slower - it uses task::spawn_blocking to send the task to separate thread pool but async-std::fs uses the same approach.
// async-std no longer needs explicit call to spawn_blocking():
// it detects blocking and automatically spawns a new executor
// thread:
// https://async.rs/blog/stop-worrying-about-blocking-the-new-async-std-runtime/
md5.consume(&buf[..len]);
I understand they posted a blog post about this, but as far as I understand, this was never actually implemented because it caused various problems. In any case, this doesn't seem to be the cause of the performance difference. As @luben notes, the difference appears to lie in the file IO.
I think the File IO performance issue is already known and tracked in https://github.com/tokio-rs/tokio/issues/2994
@Darksonn I was already wondering about that. If async-std moves from explicit marking to a heuristic (to detect blocking), then that heuristic better work in all cases, or users will not be happy...
@luben I suspect #2994 may be involved, but there is probably more going on. The original benchmark converted sync File to async File (async-std or tokio). When I modify the benchmark to directly open the async Files from a path, the slowdown goes down from 10x to 6x-5x. And the latter is the slowdown reported for #2994.
Is it possible that the File conversion adds another order of magnitude slowdown?
The optics for a simple user like me are not good: tokio is an async I/O library - its native File I/O should not be an order of magnitude slower than std::io...
Converted benchmark:
https://github.com/TheSven73/tokio_demo_slow/tree/async-open
async fn tokio_benchmark_run(path: &Path) {
use tokio::{
fs::File,
io::AsyncReadExt,
sync::mpsc::channel,
};
let mut md5 = md5::Context::new();
let mut f = File::open(path).await.unwrap();
let (tx, mut rx) = channel(1);
tokio::spawn(async move {
loop {
let mut buf = vec![0u8; BUF_SZ];
let bytes_read = f.read(&mut buf).await.unwrap();
tx.send((buf, bytes_read)).await.unwrap();
if bytes_read == 0 {
break;
}
}
});
loop {
let (buf, len) = rx.recv().await.unwrap();
if len == 0 {
break;
}
tokio::task::block_in_place(|| md5.consume(&buf[..len]));
}
}
The optics for a simple user like me are not good: tokio is an async I/O library - its native File I/O should not be an order of magnitude slower than std::io...
Although it is of course important to make sure that the File api is not unnecessarily slow, it is a fact that File IO is not async/await's strong point due to lack of OS apis. If all you're doing is opening files, generally you should not be using async/await.
Although it is of course important to make sure that the
Fileapi is not unnecessarily slow, it is a fact thatFileIO is not async/await's strong point due to lack of OS apis. If all you're doing is opening files, generally you should not be using async/await.
It may be true that my benchmark is not typical of most situations where you'd use async/await. But should tokio really be performing so poorly in File I/O compared to its competition, such as async-std and smol ?
Updated benchmark to include smol:
On a dual Intel i5-3340M, Linux 4.15 (Mint):
$ cargo bench
md5/Synchronous time: [148.33 ms 149.13 ms 150.20 ms]
md5/async-std time: [146.35 ms 146.81 ms 147.30 ms]
md5/tokio time: [1.2348 s 1.2609 s 1.2890 s]
md5/smol time: [192.35 ms 195.45 ms 198.22 ms]
No, it shouldn't be performing that badly of course.
Ok so I looked at the flamegraph for the tokio benchmark. Turns out it spends most of its time... zeroing buffers !
It seems that tokio's fs::File.read() behaves very differently compared to the other runtimes: on my system, it nearly always reads in chunks of 16K. The others try to fill the whole buffer before returning. Both are acceptable behaviour for read.
I used some unsafe code to eliminate zeroing, and now tokio behaves as I'd expect:
$ cargo bench
md5/Synchronous time: [141.49 ms 141.77 ms 142.05 ms]
md5/async-std time: [140.05 ms 140.20 ms 140.36 ms]
md5/tokio time: [155.29 ms 156.67 ms 158.10 ms]
md5/smol time: [200.92 ms 201.73 ms 202.53 ms]
Looks like there is no bug. Or rather, the bug was in my benchmark...
Closing this. @Darksonn and @luben thank you for your help !
Oh, so it's because your buffer was much larger than the amount of data that Tokio read in each call. It would probably be better to write the the benchmark like this:
tokio::spawn(async move {
let mut buf = Vec::with_capacity(5 * 1024 * 1024);
loop {
buf.clear();
let bytes_read = f.read_buf(&mut buf).await.unwrap();
tx.send((buf[..bytes_read].to_vec(), bytes_read)).await.unwrap();
if bytes_read == 0 {
break;
}
}
});
This uses read_buf to correctly handle uninitialized memory.
This uses
read_bufto correctly handle uninitialized memory.
Thanks! That's actually the first thing I tried, but I backed it out because none of the other executors have read_buf. I now use one line of unsafe code to create an uninitialized Vec. This ensures a fair comparison.
I'm also more referring to the shape of the loop where the vector is reused.
I see what you mean... the to_vec() will do an (unnecessary) memory copy, but of exactly the right size, so the tokio benchmark will be much better. Except that the other benchmarks might suffer from the extra copy?
This actually sounds interesting, I'll try it out...
Turns out I was able to eliminate the clear() from the inner loop:
let mut buf = vec![0u8; 5 * 1024 * 1024];
tokio::spawn(async move {
loop {
let bytes_read = f.read(&mut buf).await.unwrap();
if bytes_read == 0 {
break;
}
tx.send(buf[..bytes_read].to_vec()).await.unwrap();
}
});
while let Some(buf) = rx.recv().await {
tokio::task::block_in_place(|| md5.consume(&buf));
}
After doing this, the other benchmarks regress by about 3% (which is within Criterion's accuracy range, not noise) probably because of the extra copy. Still an excellent result, because this eliminates the unsafe code I had to use before.
I found a way to do this without extra copies, and using only safe code, but that will only work on tokio, so it wouldn't be a fair comparison:
tokio::spawn(async move {
loop {
let mut buf = bytes::BytesMut::with_capacity(BUF_SZ);
let bytes_read = f.read_buf(&mut buf).await.unwrap();
if bytes_read == 0 {
break;
}
tx.send(buf).await.unwrap();
}
});
while let Some(buf) = rx.recv().await {
tokio::task::block_in_place(|| md5.consume(&buf));
}