Tokio: File IO slow in Tokio compared to other executors

Created on 23 Nov 2020 · 14Comments · Source: tokio-rs/tokio

The following workload appears very slow - 10x slower than the equivalent sync and async-std code. Is this a bug, or am I doing something wrong?

Read a file in a thread, send the file data chunks through a channel to another thread, where an md5 hash is computed.
I put a fully running set of benchmarks here:
https://github.com/TheSven73/tokio_demo_slow

On a dual Intel i5-3340M, Linux 4.15 (Mint):

$ cargo bench
md5/Synchronous         time:   [148.33 ms 149.13 ms 150.20 ms]
md5/async-std           time:   [146.35 ms 146.81 ms 147.30 ms]
md5/tokio               time:   [1.2348  s 1.2609  s 1.2890  s]

$ cargo tree | grep tokio
└── tokio v0.3.4
    └── tokio-macros v0.3.1

async fn tokio_benchmark_run(f: File) {
    use tokio::{
        fs::File,
        io::AsyncReadExt,
        sync::mpsc::channel,
    };

    let mut md5 = md5::Context::new();
    let mut f: File = f.into();
    let (tx, mut rx) = channel(1);

    tokio::spawn(async move {
        loop {
            let mut buf = vec![0u8; BUF_SZ];
            let bytes_read = f.read(&mut buf).await.unwrap();
            tx.send((buf, bytes_read)).await.unwrap();
            if bytes_read == 0 {
                break;
            }
        }
    });

    loop {
        let (buf, len) = rx.recv().await.unwrap();
        if len == 0 {
            break;
        }
        tokio::task::block_in_place(|| md5.consume(&buf[..len]));
    }
}

A-tokio C-bug

Source

TheSven73

All 14 comments

Try:

async fn tokio_benchmark_run(mut f: File) {
    use tokio::sync::mpsc::channel;

    let mut md5 = md5::Context::new();
    let (tx, mut rx) = channel(1);

    tokio::spawn(async move {
        loop {
            let mut buf = vec![0u8; BUF_SZ];
            let bytes_read = tokio::task::block_in_place(|| f.read(&mut buf).unwrap());
            tx.send((buf, bytes_read)).await.unwrap();
            if bytes_read == 0 {
                break;
            }
        }
    });

    loop {
        let (buf, len) = rx.recv().await.unwrap();
        if len == 0 {
            break;
        }
        tokio::task::block_in_place(|| md5.consume(&buf[..len]));
    }
}

I get similar results across the three approaches.

Though this does not answer the question why the tokio::fs is slower - it uses task::spawn_blocking to send the task to separate thread pool but async-std::fs uses the same approach.

luben on 23 Nov 2020

// async-std no longer needs explicit call to spawn_blocking():
// it detects blocking and automatically spawns a new executor
// thread:
// https://async.rs/blog/stop-worrying-about-blocking-the-new-async-std-runtime/
md5.consume(&buf[..len]);

I understand they posted a blog post about this, but as far as I understand, this was never actually implemented because it caused various problems. In any case, this doesn't seem to be the cause of the performance difference. As @luben notes, the difference appears to lie in the file IO.

Darksonn on 23 Nov 2020

👍1

I think the File IO performance issue is already known and tracked in https://github.com/tokio-rs/tokio/issues/2994

luben on 23 Nov 2020

👍1

@Darksonn I was already wondering about that. If async-std moves from explicit marking to a heuristic (to detect blocking), then that heuristic better work in all cases, or users will not be happy...

TheSven73 on 23 Nov 2020

@luben I suspect #2994 may be involved, but there is probably more going on. The original benchmark converted sync File to async File (async-std or tokio). When I modify the benchmark to directly open the async Files from a path, the slowdown goes down from 10x to 6x-5x. And the latter is the slowdown reported for #2994.

Is it possible that the File conversion adds another order of magnitude slowdown?

The optics for a simple user like me are not good: tokio is an async I/O library - its native File I/O should not be an order of magnitude slower than std::io...

Converted benchmark:
https://github.com/TheSven73/tokio_demo_slow/tree/async-open

async fn tokio_benchmark_run(path: &Path) {
    use tokio::{
        fs::File,
        io::AsyncReadExt,
        sync::mpsc::channel,
    };

    let mut md5 = md5::Context::new();
    let mut f = File::open(path).await.unwrap();
    let (tx, mut rx) = channel(1);

    tokio::spawn(async move {
        loop {
            let mut buf = vec![0u8; BUF_SZ];
            let bytes_read = f.read(&mut buf).await.unwrap();
            tx.send((buf, bytes_read)).await.unwrap();
            if bytes_read == 0 {
                break;
            }
        }
    });

    loop {
        let (buf, len) = rx.recv().await.unwrap();
        if len == 0 {
            break;
        }
        tokio::task::block_in_place(|| md5.consume(&buf[..len]));
    }
}

TheSven73 on 23 Nov 2020

The optics for a simple user like me are not good: tokio is an async I/O library - its native File I/O should not be an order of magnitude slower than std::io...

Although it is of course important to make sure that the File api is not unnecessarily slow, it is a fact that File IO is not async/await's strong point due to lack of OS apis. If all you're doing is opening files, generally you should not be using async/await.

Darksonn on 24 Nov 2020

Although it is of course important to make sure that the File api is not unnecessarily slow, it is a fact that File IO is not async/await's strong point due to lack of OS apis. If all you're doing is opening files, generally you should not be using async/await.

It may be true that my benchmark is not typical of most situations where you'd use async/await. But should tokio really be performing so poorly in File I/O compared to its competition, such as async-std and smol ?

Updated benchmark to include smol:
On a dual Intel i5-3340M, Linux 4.15 (Mint):

$ cargo bench
md5/Synchronous         time:   [148.33 ms 149.13 ms 150.20 ms]
md5/async-std           time:   [146.35 ms 146.81 ms 147.30 ms]
md5/tokio               time:   [1.2348  s 1.2609  s 1.2890  s]
md5/smol                time:   [192.35 ms 195.45 ms 198.22 ms]

TheSven73 on 24 Nov 2020

No, it shouldn't be performing that badly of course.

Darksonn on 24 Nov 2020

Ok so I looked at the flamegraph for the tokio benchmark. Turns out it spends most of its time... zeroing buffers !

It seems that tokio's fs::File.read() behaves very differently compared to the other runtimes: on my system, it nearly always reads in chunks of 16K. The others try to fill the whole buffer before returning. Both are acceptable behaviour for read.

I used some unsafe code to eliminate zeroing, and now tokio behaves as I'd expect:

$ cargo bench
md5/Synchronous         time:   [141.49 ms 141.77 ms 142.05 ms]
md5/async-std           time:   [140.05 ms 140.20 ms 140.36 ms]
md5/tokio               time:   [155.29 ms 156.67 ms 158.10 ms]
md5/smol                time:   [200.92 ms 201.73 ms 202.53 ms]

Looks like there is no bug. Or rather, the bug was in my benchmark...
Closing this. @Darksonn and @luben thank you for your help !

TheSven73 on 25 Nov 2020

Oh, so it's because your buffer was much larger than the amount of data that Tokio read in each call. It would probably be better to write the the benchmark like this:

tokio::spawn(async move {
    let mut buf = Vec::with_capacity(5 * 1024 * 1024);
    loop {
        buf.clear();
        let bytes_read = f.read_buf(&mut buf).await.unwrap();
        tx.send((buf[..bytes_read].to_vec(), bytes_read)).await.unwrap();
        if bytes_read == 0 {
            break;
        }
    }
});

This uses read_buf to correctly handle uninitialized memory.

Darksonn on 25 Nov 2020

This uses read_buf to correctly handle uninitialized memory.

Thanks! That's actually the first thing I tried, but I backed it out because none of the other executors have read_buf. I now use one line of unsafe code to create an uninitialized Vec. This ensures a fair comparison.

TheSven73 on 25 Nov 2020

I'm also more referring to the shape of the loop where the vector is reused.

Darksonn on 25 Nov 2020

I see what you mean... the to_vec() will do an (unnecessary) memory copy, but of exactly the right size, so the tokio benchmark will be much better. Except that the other benchmarks might suffer from the extra copy?

This actually sounds interesting, I'll try it out...

TheSven73 on 25 Nov 2020

Turns out I was able to eliminate the clear() from the inner loop:

    let mut buf = vec![0u8; 5 * 1024 * 1024];

    tokio::spawn(async move {
        loop {
            let bytes_read = f.read(&mut buf).await.unwrap();
            if bytes_read == 0 {
                break;
            }
            tx.send(buf[..bytes_read].to_vec()).await.unwrap();
        }
    });

    while let Some(buf) = rx.recv().await {
        tokio::task::block_in_place(|| md5.consume(&buf));
    }

After doing this, the other benchmarks regress by about 3% (which is within Criterion's accuracy range, not noise) probably because of the extra copy. Still an excellent result, because this eliminates the unsafe code I had to use before.

I found a way to do this without extra copies, and using only safe code, but that will only work on tokio, so it wouldn't be a fair comparison:

    tokio::spawn(async move {
        loop {
            let mut buf = bytes::BytesMut::with_capacity(BUF_SZ);
            let bytes_read = f.read_buf(&mut buf).await.unwrap();
            if bytes_read == 0 {
                break;
            }
            tx.send(buf).await.unwrap();
        }
    });

    while let Some(buf) = rx.recv().await {
        tokio::task::block_in_place(|| md5.consume(&buf));
    }

TheSven73 on 25 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings