Runtime: [Linux] Support async I/O with uring / liburing

Created on 7 May 2019 · 11Comments · Source: dotnet/runtime

It seems that Linux finally has a good story when it comes to async I/O: io_uring

While this is only released as part of Kernel 5.1, it definitely looks like a game changer when it comes to async I/O perf...

There's no point in going into what io_uring brings to the table in this issue, as it should be pretty clear from the linked PDF document, it is worth while to mention that this allows for some super high-perf scenarios by using advanced features such as:

Pre-registering I/O related file descriptors to avoid expensive kernel-side (!) reference counting
Pre-registering fixed buffers (in conjunction with O_DIRECT) to avoid expensive page-table manipulation on the kernel side (!)
Using Polled I/O to entirely avoid system calls when reading/writing data
Using batched operations

Some initial tests for File I/O from nodejs point to very substantial latency reductions (I'm "reprinting" the latency numbers from the linked comment):

I think that supporting this in CoreCLR can lead to substantial improvemnt of async I/O in Linux...
At the same time, its not clear to me how/if/when CoreCLR should adopt this and at what abstraction level...

area-System.IO

Source

damageboy

👍9

Most helpful comment

@omariom In spirit it definitely is, but there are a few key differences:

Not limited to sockets (Unlike RIO), can also do storage and really anything that is an FD
(Can also read from arbitrary offests inside the file)
Can be much less chatty, all the way to completely syscall-less I/O
Not limited to predefined buffers (e.g. Registered buffers) like RIO, although, when that is used (in io_uring paralance: fixed buffers) that can vastly improve perf and reduce latency.
Lastly, if I'm not mistaken, RIO's API is somewhat limiting in the interaction between request-queues (RQ) and completion-queues (CQ) in the sense that (IIRC) the completion queues have a finite size, and the total amount of requests in the RQs associated with a given CQ cannot exceed the size of the CQ. I'm less sure about this last one, but as far as I can tell, io_uring is more ad-hoc in this respect and somewhat more dynamic in its nature...

damageboy on 8 May 2019

👍5

All 11 comments

/cc @tmds

benaadams on 7 May 2019

@benaadams Looks similar to Windows RIO?

omariom on 8 May 2019

@omariom In spirit it definitely is, but there are a few key differences:

Not limited to sockets (Unlike RIO), can also do storage and really anything that is an FD
(Can also read from arbitrary offests inside the file)
Can be much less chatty, all the way to completely syscall-less I/O
Not limited to predefined buffers (e.g. Registered buffers) like RIO, although, when that is used (in io_uring paralance: fixed buffers) that can vastly improve perf and reduce latency.
Lastly, if I'm not mistaken, RIO's API is somewhat limiting in the interaction between request-queues (RQ) and completion-queues (CQ) in the sense that (IIRC) the completion queues have a finite size, and the total amount of requests in the RQs associated with a given CQ cannot exceed the size of the CQ. I'm less sure about this last one, but as far as I can tell, io_uring is more ad-hoc in this respect and somewhat more dynamic in its nature...

damageboy on 8 May 2019

👍5

Can be much less chatty, all the way to completely syscall-less I/O

Is also one of the options of RIO. Does have a flavour of completion ports e.g. async with fast-path sync completion if data is already ready, rather than having to do callback.

Extending the registration to all IO is good; especially with advancements in throughput e.g. NVMe.

Definitely interesting!

benaadams on 8 May 2019

In apps consisting of microservices
large connection buffers is less an issue and low latency with low jitter are valued more than raw throughput.
So it would be great to have both io_uring and RIO in Kestrel. Kestrel is used by gRPC which is the most popular transport for communication between microservices.

omariom on 8 May 2019

(I'm the one who was playing with io_uring in Node.js/libuv, mentioned in the OP; sharing some notes here.)

io_uring is more like Windows' "overlapped IO" with IOCP than RIO in terms of usability with files.

all the way to completely syscall-less I/O

(Referring to kernel-side polling of the submission queue) Note that this requires root, https://github.com/torvalds/linux/commit/3ec482d15cb986bf08b923f9193eeddb3b9ca69f#diff-a196e54ec8b5398427f9df3d2b074478.

RIO's API is somewhat limiting ... in the sense that (IIRC) the completion queues have a finite size...

io_uring still has fixed-sized SQs and CQs. It's immediately safe to use a SQ slot once io_uring_enter/submit returns (before the kernel is done processing it). There's a tiny bit of info on what happens with full CQs in http://git.kernel.dk/cgit/liburing/commit/?id=76b61ebf1bd17d3a31c3bf2d8236b9bd50d0f9a8 but I'm still uncertain what happens if you submit more events and e.g. never drain the CQ.

since the sqe lifetime is only that of the actual submission of it, it's possible for the application to drive a higher pending request count than the SQ ring size would indicate. The application must take care not to do so,or it could risk overflowing the CQ ring. By default, the CQ ring is twice the size of the SQ ring. This allows the application some amount of flexibility in managing this aspect, but it doesn't completely remove the need to do so. If the application does violate this restriction, it will be tracked as an overflow condition in the CQ ring. More on that later.

but I can't find the "later" part :). I assume the CQE just gets overwritten.

zbjornson on 8 May 2019

👍2

And here's the answer on CQ overflow: https://twitter.com/axboe/status/1126203058071826432

CQEs do not get overwritten, cqring.overflow just increments. The app has to be grossly negligent to trigger that, as the CQE ring is twice the SQE ring. If cqring.overflow is ever != 0, the app has failed.

zbjornson on 8 May 2019

👍1

Is there some info on how you use this with sockets? In particular how to deal with blocking calls.
Should you add a blocking read/write to io_uring and then check for it's completion?
Or do you need to use io_uring for polling? And then when readable/writable add non-blocking reads/writes to it?
Or something else?

tmds on 10 May 2019

Is there some info on how you use this with sockets? In particular how to deal with blocking calls.
Should you add a blocking read/write to io_uring and then check for it's completion?
Or do you need to use io_uring for polling? And then when readable/writable add non-blocking reads/writes to it?
Or something else?

To answer my own question. You can use io_uring like epoll in one-shot mode. There is a command to add poll for a fd, and there is a command to cancel an on-going poll.

tmds on 27 Jun 2019

👍4

When looking into io_uring we also need to consider what operations are privileged, and what kernel resources are needed.

To have wide applicability, it should work in a Kubernetes container deployment.

CQEs do not get overwritten, cqring.overflow just increments. The app has to be grossly negligent to trigger that, as the CQE ring is twice the SQE ring.

If you're writing to disk, you can control this.
I wonder if this doesn't become an issue if you use io_uring for sockets. If you have a lot of polls going on for idle connections, maybe some activity on those sockets can get you into a cqe overflow.

tmds on 16 Dec 2019

CQEs do not get overwritten, cqring.overflow just increments. The app has to be grossly negligent to trigger that, as the CQE ring is twice the SQE ring.

If you're writing to disk, you can control this.
I wonder if this doesn't become an issue if you use io_uring for sockets. If you have a lot of polls going on for idle connections, maybe some activity on those sockets can get you into a cqe overflow.

It does this smarter now and won't drop anything. Since 5.5, I believe.
see 1d7bb1d50fb4dc14 ("io_uring: add support for backlogged CQ ring") or here

Cc: @axboe