Julia: Review of IO blocking behaviour

Created on 8 Nov 2017 · 23Comments · Source: JuliaLang/julia

I've recently been helping @quinnj to debug his new HTTP.jl package, in particular a new FIFOBuffer type that is intended to behave like the IO buffers in Base (#87, #86, #76, #75, #74). The process of trying to be consistent with Base has highlighted a number of IO behaviour inconsistencies. I've hacked up a script that runs through a sequence of IO operations for various types and produces a MD table of the results (see below).

The two main issues are with the blocking behaviour of read() and eof().

The spec for eof() says: _"If the stream is not yet exhausted, this function will block to wait for more data if necessary, and then return false."_

[x] eof() works per spec for BufferStream and TCPSocket.
[ ] For Filesystem.File, IOStream, and PipeBuffer, eof() does not block to wait for more data. Instead it returns true without blocking if there is not currently data available to be read.
[ ] For IOBuffer it seems that eof() just always returns true.

The spec for read(::IO, ::Int) says: _"[By default] this function will block repeatedly trying to read all requested bytes, until an error or end-of-file occurs."_

[x] For BufferStream and TCPSocket, read() behaves as specified.
[ ] However, for the other types read() seems to just return however many bytes are available at the time and does not block.
[ ] For IOBuffer, read() always returns an empty array.

Other issues:

[ ] For BufferStream and IOStream, isreadable() returns true after close() is called.
[ ] For BufferStream, iswriteable() returns true after close().
- [ ] For BufferStream and IOStream, read() after close() returns empty data, whereas the other types throw an error.
[ ] For BufferStream and TCPSocket, read(io, String) blocks until the stream is closed (this seems consistent with the blocking behaviour of read(io, nb) , however the other types return immediately with a string containing however many bytes are available at the time.
[ ] mark/reset don't work for BufferStream https://github.com/JuliaLang/julia/issues/24465
[ ] There is no API for sending TCP FIN, e.g. shutdown(fd, SHUT_WR) or uv_shutdown(). If a TCP server waits for a request to be sent before responding, a Julia client would have to call close() to signal end of request, but would then be unable to read the response.
[ ] Perhaps the missing shutdown is related to the inconsistencies with isreadable() and iswriteable(). It seem like maybe there should be a closeread() and closewrite() that respectively cause isreadable() and iswriteable() to return false.
[ ] Calling close on BufferStream causes eof() = true and isopen() = false, but iswriteable() and isreadable() are both still true, and in fact reads and writes continue to work with the closed stream. It seems that BufferStream would benefit from a seperate closewrite() for signalling eof() to the reader.

| type | IOBuffer | PipeBuffer | BufferStream | File (IOStream) | Filesystem |
| --- | --- | --- | --- | --- | --- |
| init | io = IOBuffer() | io = PipeBuffer() | io = BufferStream() | echo Hello > file | echo Hello > file |
| | write(io, "Hello") | write(io, "Hello") | write(io, "Hello") | open("file") | Filesystem. open("file") |
| isreadable() | true | true | true | true | MethodError |
| isopen() | true | true | true | true | true |
| eof() | true❗️ | false | false | false | false |
| position | 5 | 0 | MethodError | 0 | 0 |
| read(io, 5) | ""❗️ | "Hello" | "Hello" | "Hello" | "Hello" |
| eof() | true | true❗️ | _blocked_ ✅ | true | true |
| "Again" | write(io, "Again") | write(io, "Again") | write(io, "Again") | echo Again >> file | echo Again >> file |
| position(io) | 10 | 0 | MethodError | 5 | 5 |
| eof(io) | true❗️ | false | false | false | false |
| read(io, 5) | ""❗️ | "Again" | "Again" | "Again" | "Again" |
| eof(io) | true | true❗️ | _blocked_ ✅ | true | true |
| "Again" | write(io, "Again") | write(io, "Again") | write(io, "Again") | echo Again >> file | echo Again >> file |
| eof(io) | true❗️ | false | false | false | false |
| read(io, String) | ""❗️ | "Again" | _blocked_ ✅ | "Again" | "Again" |
| iswritable(io) | true | true | true | false | MethodError |
| isreadable(io) | true | true | true | true | MethodError |
| close(io); isopen(io) | false | false | false | false | false |
| iswritable(io) | false | false | true❗️ | false | MethodError |
| isreadable(io) | false | false | true❗️ | true❗️ | MethodError |
| eof(io) | true | true | true | true | Base.UVError |
| read(io, 5) | Argument Error | Argument Error | ""❗️ | ""❗️ | Base.UVError |

Note that TCPSocket behaves almost the same as BufferStream but without the isopen() and isreadable() issues:

| type | BufferStream | TCPSocket |
| --- | --- | --- |
| init | io = BufferStream() | srv = accept(); io = connect() |
| | write(io, "Hello") | write(srv, "Hello") |
| isreadable() | true | true |
| isopen() | true | true |
| eof() | false | false |
| position | MethodError | MethodError |
| read(io, 5) | "Hello" | "Hello" |
| eof(io) | blocked | blocked |
| close(io); isopen(io) | true ❗️ | false |
| eof(io) | true | true |
| isreadable(io) | true❗️| false |
| read(io, 5) | "" | "" |

Source

samoconnor

❤19

Most helpful comment

I'm not an IO guru so have nothing concrete to add, but just have to say wow, really nice analysis.

timholy on 8 Nov 2017

👍3

All 23 comments

I'm not an IO guru so have nothing concrete to add, but just have to say wow, really nice analysis.

timholy on 8 Nov 2017

👍3

What do you want eof() to do for File and IOBuffer? Block forever since there's no telling when additional bytes could be written to the file/buffer?

stevengj on 8 Nov 2017

Hi @stevengj,

I'm not sure what the right interface is, but I am sure that it needs to be well specified and unambiguous so that compatible implementations can be dropped in and "just work".

If read(io, nb) was made to block for simple files, callers who wanted to read just what is in the file now would have to do read(myfile, filesize(myfile)). To be consistent with the current doc, eof() for a simple local file should probably be false/block unless the file is unlinked and there are no other open write file handles on the file.

I understand that a very common use case for simple files is to want to read to the end of what is in the file now. However, the less common use case of reading form a file that is occasionally appended to by a 3rd party needs to be handled too.

Consider also that the IOStream returned by the simple open() function is not always a simple file. See the example below of opening a named pipe (mkfifo). In this case eof() = true seems wrong because a pipe can be expected to have more data at some future time. Likewise, the read(io, 6) call should have blocked. If I'm reading a fixed number of bytes from a named pipe it is probably because there is an agreed protocol where I'm expecting the other end to write that many bytes.

| type | File | NamedPipe |
| --- | --- | --- |
| init | echo Hello > file | mkfifo fifo; echo Hello > fifo |
| | open("file") | open("fifo") |
| typeof(io) | IOStream | IOStream |
| isopen() | true | true |
| eof() | false | false |
| read(io, 5) | "Hello" | "Hello" |
| eof(io) | true | true |
| "Again" | echo Again >> file | echo Again > fifo |
| eof(io) | false | false |
| read(io, 6) | "Again" | "Again" |
| eof(io) | true | true |

Maybe the answer is to have a user selectable blocking/non-blocking mode (O_NONBLOCK).

Maybe it would be better to have clearly distinct types for simple-file-like-things and stream-like-things.

Or maybe it's best to enforce blocking everywhere and encourage the use of tasks and/or patterns like read(myfile, filesize(myfile)).

Another alternative would be to have a WouldBlockError exception that would be thrown by anything that would block when the "i don't want things to block" option has been set (this would allow writing code that acts as if nothing ever blocks, but not have it deadlock in odd ways when there is a problem).

samoconnor on 9 Nov 2017

For IOBuffer it seems that eof() just always returns true.

What does this mean? eof of IOBuffer may return false.

julia> buf = IOBuffer(b"foobar");

julia> eof(buf)
false

bicycle1885 on 16 Nov 2017

Hi @bicycle1885,
I think the difference is that my test script uses the IOBuffer() constructor.
So, I guess my finding is that for the default IOBuffer() constructor, it seems that eof() always returns true for the series of function calls in my test scenario. e.g.

julia> io = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)

julia> eof(io)
true

julia> write(io, "Hello")
5

julia> eof(io)
true

julia> write(io, " World!")
7

julia> eof(io)
true

julia> read(io)
0-element Array{UInt8,1}

julia> String(take!(io))
"Hello World!"

samoconnor on 16 Nov 2017

write changes the offset of a buffer so it is quite natural that a sequence of write operations does not change the return value. However, seek operations (e.g. seekstart) will change the return value:

julia> buf = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)

julia> write(buf, b"foo")
3

julia> eof(buf)
true

julia> seekstart(buf)
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=3, maxsize=Inf, ptr=1, mark=-1)

julia> eof(buf)
false

bicycle1885 on 16 Nov 2017

@bicycle1885 I accept that the behaviour of IOBuffer() in isolation could seem "quite natural" in a certain context. i.e. If you expect it to behave like a local disk file. However, as things stand the documentation about how each IO type is supposed to behave is not clear.

The point of posting this issue is to highlight behaviour inconsistencies between the various IO types. I realised that I'd kind of figured out how to use the IO types I was familiar with, but that I this was largely a matter of trial and error and copying usage patterns from other code. When it came to trying to debug the HTTP.jl package, and coming up against questions of "how should function this behave", it wasn't a simple matter of "read and follow the spec". The task seemed to come down to figuring out which of the many IO variants in Base was most like what we wanted and emulating it as best we could.

Maybe the solution is just to have clearer documentation, or as I said above "Maybe it would be better to have clearly distinct types for simple-file-like-things and stream-like-things.". Maybe there needs to be seperate documentation for eof(::AbstractFile) and eof(::AbstractStream).

samoconnor on 16 Nov 2017

👍1

This is a great analysis and we do want to make sure all of this is sane and consistent, but it's a bit too much to bite off for the impending 1.0 release, so based on discussion on triage, we're going to say that I/O blocking behavior is not yet stable and may change in the 1.x series. We will have to be careful to make sure that any changes we make don't break key libraries (e.g. HTTP).

StefanKarpinski on 16 Nov 2017

Luckily, HTTP.jl currently uses it's own streaming buffer type (HTTP.FIFOBuffer), so that should keep it arms-distance from changes in Base until things settle.

quinnj on 16 Nov 2017

Maybe we need something like @oxinabox's https://github.com/oxinabox/InterfaceTesting.jl for IO.

samoconnor on 21 Nov 2017

24242

samoconnor on 8 Feb 2018

10292

samoconnor on 8 Feb 2018

A bit ago I wrote up some notes trying to wrap my brain around the streaming interface in the context of trying to make my SampledSignals.jl stream semantics better match Base. Definitely not as nicely-organized as @samoconnor's work here, but might be useful.

Also relevant is JuliaIO/FileIO.jl#78, where I'd like to support opening up media files in a streaming way, where you end up with a stream of images or audio samples, rather than a stream of bytes.

[edit: premature submission and wrong issue link]

ssfrr on 20 Feb 2018

I suspect if we add WebSocket to this analysis, it probably would not fare too well 😅

EricForgy on 20 Feb 2018

The main difference between the way I designed the SampledSignals streams and Base is that I don't have an eof method, and streams will always block on a [read|write] if there's not enough [data|space] available. So reading to the end of a stream looks like:

const N = 4096 # amount to read each time
while true
    buf = read(stream, N)
    # do stuff with buf
    length(buf) == N || break
end

read! and write both return the number of elements [read|written], which you can check similarly. This works well in the context of a stream that's always flowing until it reaches its final end (like an audio device), but maybe not in the case where you have occasional bits and pieces of unknown size coming in and want to process them in a timely manner (though maybe in that case you just set N=1?)

ssfrr on 20 Feb 2018

There are three fundamental aspects to a read* API call:

allocation: _read into the caller's buffer? or return a new buffer?_
termination: _read a specified size?, up to a delimiter? or read as much as possible?_
blocking: _wait however long it takes for the termination condition to be reached? or return early if insufficient data is available?_

A fourth aspect is return type (e.g. read(io, T) converts the bytes read to type T and read!(io, array) fills in a fixed type array). However, this is largely handled by generic wrapper methods and should not influence the design of the fundamental API too much.

The current API looks like this:

| Function | Allocation | Termination | Blocking |
| ----------------------- | ---------------------- | --------------------- | ----- |
| unsafe_read(io, p, n) | copy | n bytes | ✅ |
| read!(io, array) | copy | sizeof(array) bytes | ✅ |
| read(io, n) | alloc | n bytes | ❌ * |
| readbytes!(io, b, n) | copy (with realloc **) | n bytes | ❌ * |
| readuntil(io, delim) | alloc | delim | ✅ |
| readavailable(io) | alloc | 1 byte | ✅ |
| read(io) | alloc | eof() | ✅ |
| read(io, String) | alloc | eof() | ✅ |
| read(io, T) | alloc | sizeof(T) bytes | ✅ |

There are a few gaps and inconsistencies:

There is no way to do a non-blocking unsafe_read.
read!(io, array) is blocking but read(io, n) and readbytes!(io, b, n) are not.
readavailable sounds like it should be non-blocking but the spec says to wait for 1 byte.
[*] read(::IOStream, nb) and readbytes!(::IOStream, b, nb) both wait for nb bytes (which contradicts the generic spec). See https://github.com/JuliaLang/julia/issues/17070#issuecomment-423008560
[**] readbytes!(io, b, nb) sometimes reallocates b.

Suggested refinements to better support the various combinations of allocation/termination/blocking :

Change readavailable(io) to be non-blocking.

It seems wrong for a request for "all available data" to return data that was not available, but arrived some time later.
In the typical use pattern (while !eof(io) process(readavailable(io)) ; end) it is the job of eof to be blocking. It seems odd for readavailable to block as well.

Change readbytes!(io, b, n) and read(io, n) to be always blocking.

This would consistent with the all=true default that the ::IOStream methods use.
We could then say "with the exception of readavailable all read* methods are blocking".

Add readavailable(io, n), readavailable!(io, b, n) and unsafe_readavailable(io, p, n).

These would be non-blocking versions of read(io, n), read!(io, b, n) and unsafe_read(io, p, n).
Makes the non-blockingness obvious and explicit.
Allows the inconsistent all= option of ::IOStream methods to be removed.

Add wait(io) = (eof() && throw(EOFError); nothing):

Seems consistent with wait(::RawFD) and wait(::Channel).
Based on user issues I've seen, It seems that the fact that eof(io) is the way to wait for an IO stream is not immediately obvious.

samoconnor on 20 Oct 2018

Another thing: readbytes! stands out as an odd name now that readstring has been replaced by read(io, String).
Why not replace readbytes!(io, b::AbstractVector{UInt8}, n) with
read!(io, b::AbstractVector{UInt8}, n)?

samoconnor on 20 Oct 2018

After the refinements proposed above, we would end up with this:

| Allocation | Termination | Blocking | Function |
| ---------- | ----------- | -------- | ------------------------------------------------------------ |
| alloc | n bytes | ✅ | read(io, n)
read(io, T) |
| alloc | n bytes | ❌ | readavailable(io, n) |
| alloc | eof | ✅ | read(io) |
| alloc | eof | ❌ | readavailable(io) |
| copy | n bytes | ✅ | read!(io, b, n)
read!(io, array)
unsafe_read(io, p, n) |
| copy | n bytes | ❌ | readavailable!(io, b, n)
unsafe_readavailable(io, p, n) |
| copy | eof | ✅ | read!(io, b, typemax(Int)) |
| copy | eof | ❌ | readavailable!(io, b, typemax(Int)) |

samoconnor on 20 Oct 2018

👍1

it's a bit too much to bite off for the impending 1.0 release

@StefanKarpinski where do you see this fitting into the 1.x release plan?

This kind of bug comes up quite frequently: https://github.com/JuliaWeb/MbedTLS.jl/issues/186 where it kind of works but seems a bit glitchy or slow sometimes. The root cause is often that someone wrote some generic code that assumed a read would be non-blocking, but by the time the call makes its way through a few generic base/io.jl methods it ends up in an external library read method that is blocking (or that blocks in some situations).

Julia should be a great language for building easily compossible pieces of IO machinery; and for building data consumers and producers that work regardless of where data came from (or how it was buffered, encrypted, proxied, cached, etc). But, unless the Base.IO API has unambiguous and complete behaviour contracts, it will never work reliably.

samoconnor on 3 Nov 2018

I don't know, I'm not really the person to ask. @JeffBezanson, @vtjnash, @Keno, opinions on this?

StefanKarpinski on 3 Nov 2018

Awesome work - I think this is a really nice and consistent architecture, and I've also been bitten by the current state of things. I'm doubtful that this would be a 1.x sort of thing though, right? Seems like some of these changes would be breaky enough that they'd need to wait until 2.0.

Questions

Why is read!(io, array) listed under "alloc"? Shouldn't that be a "copy" function?
In read!(io, b, n), do you need the n, or do you just pass in a View if you want to fill a subset of your array? (I guess it's useful for larger n)?
In read!(io, b, n), what do you do if n > length(b)? Do we keep the current behavior of reallocating a larger b? I'm guessing yes because otherwise I don't think the typemax(Int) versions make sense, but I wanted to make sure. I'd be happy dropping this behavior, never reallocating, and also removing the n argument, but perhaps the auto-reallocation comes in handy for some folks.
Should unsafe_read be unsafe_read!? (also unsafe_readavailable!)
is there a way to avoid special-casing read!(io, b, typemax(Int))? If it's going to be a sentinel value maybe -1 is better because it's not a valid length. (again this becomes a non-issue if the auto-reallocation behavior is removed).

Comments

A fourth aspect is return type (e.g. read(io, T) converts the bytes read to type T and read!(io, array) fills in a fixed type array). However, this is largely handled by generic wrapper methods and should not influence the design of the fundamental API too much.

Another asymmetry in the allocating/non-allocating versions is when you want to get a Vector{T}, e.g. you can do read!(io, arr::Vector{Float64}) and it will fill the array, but there's no read equivalent. Perhaps there could also be a read(::IO, ::Type, ::Integer) method that would return a Vector? So in that case it could be defined read(io::IO, T::Type=UInt8, n::Integer).

In the typical use pattern (while !eof(io) process(readavailable(io)) ; end) it is the job of eof to be blocking. It seems odd for readavailable to block as well.

I agree it's weird for readavailable to block if there are 0 bytes available, but TBH it also seems weird for eof to block - checking whether a stream has ended seems orthogonal to waiting for data to be available. If anything I'd propose making the wait(io) function be the way to wait for data, and have eof be nonblocking.

ssfrr on 3 Nov 2018

Why is read!(io, array) listed under "alloc"? Shouldn't that be a "copy" function?

Yes, corrected above.

In read!(io, b, n), do you need the n, or do you just pass in a View if you want to fill a subset of your array?

I think we should keep the n (and probably add copyto!(b, offset, io, n)).
If the compiler (and GC) becomes smart enough one day to ensure that View never causes allocation, then those could become undocumented internal functions and the doc could say "just use a View". But I think it makes sense to defined the IO primitives in terms of the simplest possible types.

In read!(io, b, n), what do you do if n > length(b)? Do we keep the current behavior of reallocating a larger b?

To me having a read! method that reallocates its argument smells a bit wrong. I would prefer to seperate the auto-sizing by having a special AbstractArray type that auto-grows on access; or by just calling sizehint!(array, bytesavailable(io)) before the read.

I'm not sure that I understand what use case the auto-sizeing is supposed to help with.
Consider this scenario:

buf = Vector{UInt8}(undef, 4096)
while !eof(io)
    readbytes!(io, buf)
    render(my_display, buf)
end

I start out reading small chunks of data and achieving low output latency.
This works nicely until my code gets busy doing something else and a few MB end up in the OS input buffers...
Then my next read sees that there is lots of data available and ends up reallocating my buffer to several MB!
Worse, after that every subsequent readbytes! call is now blocking to wait for several MB (because the default nb is length(buf) and readbytes! is blocking).

Should unsafe_read be unsafe_read!? (also unsafe_readavailable!)

The current unsafe_read has no bang. I have no strong opinion on this. The bang convention is not applied very consistently as far as I can see.

... this becomes a non-issue if the auto-reallocation behavior is removed

That would be my preference.

Perhaps there could also be a read(::IO, ::Type, ::Integer) method that would return a Vector? So in that case it could be defined read(io::IO, T::Type=UInt8, n::Integer).

That sounds pretty sane. I guess the readavailable version would read what is available modulo the size of T.

it also seems weird for eof to block - checking whether a stream has ended seems orthogonal to waiting for data to be available. If anything I'd propose making the wait(io) function be the way to wait for data, and have eof be nonblocking.

Yes. I agree. eof(io) should be just !isreadable(io) && bytesavailable(io) == 0.
... and this brings up the issue of separating open/closed state for read and write halves of the stream.
(closeread(io) ensures isreadable(io) == false, closewrite(io) ensures iswriteable(io) == false ?)

samoconnor on 3 Nov 2018

👍1

Question, as of 2020-02-11 what is the recommended, geneeral way to read from a stream multiple times onto a preallocated array (possibly without extra allocations)? (eg, if read provided the start/end range)

buf =  Vector{UInt8}(undef, 30)
read!(buf, start=1, end=10)
read!(buf, start=11, end=20)
read!(buf, start=21, end=30)

Would slicing with view do the job in Julia? In C you would do this with pointer arithmetics,

ssize_t read(int fd, void *buf, size_t count);
size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream);

norru on 11 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

add special display for ≈ test failures

StefanKarpinski · 3Comments

Should `where` be a keyword?

yurivish · 3Comments

+(x::T,y::T) where {T} not parsing correctly

musm · 3Comments

Broken booleans as numbers?

TotalVerb · 3Comments

Consider adding a `VecTuple{N,T}` to hold N `VecElements{T}`s

m-j-w · 3Comments