It seems that Linux finally has a good story when it comes to async I/O: io_uring
While this is only released as part of Kernel 5.1, it definitely looks like a game changer when it comes to async I/O perf...
There's no point in going into what io_uring brings to the table in this issue, as it should be pretty clear from the linked PDF document, it is worth while to mention that this allows for some super high-perf scenarios by using advanced features such as:
O_DIRECT
) to avoid expensive page-table manipulation on the kernel side (!)Some initial tests for File I/O from nodejs point to very substantial latency reductions (I'm "reprinting" the latency numbers from the linked comment):
I think that supporting this in CoreCLR can lead to substantial improvemnt of async I/O in Linux...
At the same time, its not clear to me how/if/when CoreCLR should adopt this and at what abstraction level...
/cc @tmds
@benaadams Looks similar to Windows RIO?
@omariom In spirit it definitely is, but there are a few key differences:
io_uring
is more ad-hoc in this respect and somewhat more dynamic in its nature...Can be much less chatty, all the way to completely syscall-less I/O
Is also one of the options of RIO. Does have a flavour of completion ports e.g. async with fast-path sync completion if data is already ready, rather than having to do callback.
Extending the registration to all IO is good; especially with advancements in throughput e.g. NVMe.
Definitely interesting!
In apps consisting of microservices
large connection buffers is less an issue and low latency with low jitter are valued more than raw throughput.
So it would be great to have both io_uring and RIO in Kestrel. Kestrel is used by gRPC which is the most popular transport for communication between microservices.
(I'm the one who was playing with io_uring in Node.js/libuv, mentioned in the OP; sharing some notes here.)
io_uring is more like Windows' "overlapped IO" with IOCP than RIO in terms of usability with files.
all the way to completely syscall-less I/O
(Referring to kernel-side polling of the submission queue) Note that this requires root, https://github.com/torvalds/linux/commit/3ec482d15cb986bf08b923f9193eeddb3b9ca69f#diff-a196e54ec8b5398427f9df3d2b074478.
RIO's API is somewhat limiting ... in the sense that (IIRC) the completion queues have a finite size...
io_uring still has fixed-sized SQs and CQs. It's immediately safe to use a SQ slot once io_uring_enter/submit returns (before the kernel is done processing it). There's a tiny bit of info on what happens with full CQs in http://git.kernel.dk/cgit/liburing/commit/?id=76b61ebf1bd17d3a31c3bf2d8236b9bd50d0f9a8 but I'm still uncertain what happens if you submit more events and e.g. never drain the CQ.
since the sqe lifetime is only that of the actual submission of it, it's possible for the application to drive a higher pending request count than the SQ ring size would indicate. The application must take care not to do so,or it could risk overflowing the CQ ring. By default, the CQ ring is twice the size of the SQ ring. This allows the application some amount of flexibility in managing this aspect, but it doesn't completely remove the need to do so. If the application does violate this restriction, it will be tracked as an overflow condition in the CQ ring. More on that later.
but I can't find the "later" part :). I assume the CQE just gets overwritten.
And here's the answer on CQ overflow: https://twitter.com/axboe/status/1126203058071826432
CQEs do not get overwritten, cqring.overflow just increments. The app has to be grossly negligent to trigger that, as the CQE ring is twice the SQE ring. If cqring.overflow is ever != 0, the app has failed.
Is there some info on how you use this with sockets? In particular how to deal with blocking calls.
Should you add a blocking read/write to io_uring and then check for it's completion?
Or do you need to use io_uring for polling? And then when readable/writable add non-blocking reads/writes to it?
Or something else?
Is there some info on how you use this with sockets? In particular how to deal with blocking calls.
Should you add a blocking read/write to io_uring and then check for it's completion?
Or do you need to use io_uring for polling? And then when readable/writable add non-blocking reads/writes to it?
Or something else?
To answer my own question. You can use io_uring like epoll in one-shot mode. There is a command to add poll for a fd, and there is a command to cancel an on-going poll.
When looking into io_uring we also need to consider what operations are privileged, and what kernel resources are needed.
To have wide applicability, it should work in a Kubernetes container deployment.
CQEs do not get overwritten, cqring.overflow just increments. The app has to be grossly negligent to trigger that, as the CQE ring is twice the SQE ring.
If you're writing to disk, you can control this.
I wonder if this doesn't become an issue if you use io_uring for sockets. If you have a lot of polls going on for idle connections, maybe some activity on those sockets can get you into a cqe overflow.
CQEs do not get overwritten, cqring.overflow just increments. The app has to be grossly negligent to trigger that, as the CQE ring is twice the SQE ring.
If you're writing to disk, you can control this.
I wonder if this doesn't become an issue if you use io_uring for sockets. If you have a lot of polls going on for idle connections, maybe some activity on those sockets can get you into a cqe overflow.
It does this smarter now and won't drop anything. Since 5.5, I believe.
see 1d7bb1d50fb4dc14 ("io_uring: add support for backlogged CQ ring") or here
Cc: @axboe
Most helpful comment
@omariom In spirit it definitely is, but there are a few key differences:
(Can also read from arbitrary offests inside the file)
io_uring
is more ad-hoc in this respect and somewhat more dynamic in its nature...