I've recently been helping @quinnj to debug his new HTTP.jl package, in particular a new FIFOBuffer
type that is intended to behave like the IO buffers in Base (#87, #86, #76, #75, #74). The process of trying to be consistent with Base has highlighted a number of IO behaviour inconsistencies. I've hacked up a script that runs through a sequence of IO operations for various types and produces a MD table of the results (see below).
The two main issues are with the blocking behaviour of read()
and eof()
.
The spec for eof()
says: _"If the stream is not yet exhausted, this function will block to wait for more data if necessary, and then return false."_
eof()
works per spec for BufferStream
and TCPSocket
.Filesystem.File
, IOStream
, and PipeBuffer
, eof()
does not block to wait for more data. Instead it returns true
without blocking if there is not currently data available to be read.IOBuffer
it seems that eof()
just always returns true.The spec for read(::IO, ::Int)
says: _"[By default] this function will block repeatedly trying to read all requested bytes, until an error or end-of-file occurs."_
BufferStream
and TCPSocket
, read()
behaves as specified. read()
seems to just return however many bytes are available at the time and does not block. IOBuffer
, read()
always returns an empty array.Other issues:
BufferStream
and IOStream
, isreadable()
returns true after close()
is called.BufferStream
, iswriteable()
returns true after close()
.BufferStream
and IOStream
, read()
after close()
returns empty data, whereas the other types throw an error.BufferStream
and TCPSocket
, read(io, String)
blocks until the stream is closed (this seems consistent with the blocking behaviour of read(io, nb)
, however the other types return immediately with a string containing however many bytes are available at the time.mark
/reset
don't work for BufferStream
https://github.com/JuliaLang/julia/issues/24465shutdown(fd, SHUT_WR)
or uv_shutdown()
. If a TCP server waits for a request to be sent before responding, a Julia client would have to call close()
to signal end of request, but would then be unable to read the response.shutdown
is related to the inconsistencies with isreadable()
and iswriteable()
. It seem like maybe there should be a closeread()
and closewrite()
that respectively cause isreadable()
and iswriteable()
to return false.close
on BufferStream
causes eof()
= true and isopen()
= false, but iswriteable()
and isreadable()
are both still true, and in fact reads and writes continue to work with the closed stream. It seems that BufferStream
would benefit from a seperate closewrite()
for signalling eof()
to the reader.| type | IOBuffer | PipeBuffer | BufferStream | File (IOStream) | Filesystem |
| --- | --- | --- | --- | --- | --- |
| init | io = IOBuffer() | io = PipeBuffer() | io = BufferStream() | echo Hello > file | echo Hello > file |
| | write(io, "Hello") | write(io, "Hello") | write(io, "Hello") | open("file") | Filesystem. open("file") |
| isreadable() | true | true | true | true | MethodError |
| isopen() | true | true | true | true | true |
| eof() | true❗️ | false | false | false | false |
| position | 5 | 0 | MethodError | 0 | 0 |
| read(io, 5) | ""❗️ | "Hello" | "Hello" | "Hello" | "Hello" |
| eof() | true | true❗️ | _blocked_ ✅ | true | true |
| "Again" | write(io, "Again") | write(io, "Again") | write(io, "Again") | echo Again >> file | echo Again >> file |
| position(io) | 10 | 0 | MethodError | 5 | 5 |
| eof(io) | true❗️ | false | false | false | false |
| read(io, 5) | ""❗️ | "Again" | "Again" | "Again" | "Again" |
| eof(io) | true | true❗️ | _blocked_ ✅ | true | true |
| "Again" | write(io, "Again") | write(io, "Again") | write(io, "Again") | echo Again >> file | echo Again >> file |
| eof(io) | true❗️ | false | false | false | false |
| read(io, String) | ""❗️ | "Again" | _blocked_ ✅ | "Again" | "Again" |
| iswritable(io) | true | true | true | false | MethodError |
| isreadable(io) | true | true | true | true | MethodError |
| close(io); isopen(io) | false | false | false | false | false |
| iswritable(io) | false | false | true❗️ | false | MethodError |
| isreadable(io) | false | false | true❗️ | true❗️ | MethodError |
| eof(io) | true | true | true | true | Base.UVError |
| read(io, 5) | Argument Error | Argument Error | ""❗️ | ""❗️ | Base.UVError |
Note that TCPSocket
behaves almost the same as BufferStream
but without the isopen()
and isreadable()
issues:
| type | BufferStream | TCPSocket |
| --- | --- | --- |
| init | io = BufferStream() | srv = accept(); io = connect() |
| | write(io, "Hello") | write(srv, "Hello") |
| isreadable() | true | true |
| isopen() | true | true |
| eof() | false | false |
| position | MethodError | MethodError |
| read(io, 5) | "Hello" | "Hello" |
| eof(io) | blocked | blocked |
| close(io); isopen(io) | true ❗️ | false |
| eof(io) | true | true |
| isreadable(io) | true❗️| false |
| read(io, 5) | "" | "" |
I'm not an IO guru so have nothing concrete to add, but just have to say wow, really nice analysis.
What do you want eof()
to do for File
and IOBuffer
? Block forever since there's no telling when additional bytes could be written to the file/buffer?
Hi @stevengj,
I'm not sure what the right interface is, but I am sure that it needs to be well specified and unambiguous so that compatible implementations can be dropped in and "just work".
If read(io, nb)
was made to block for simple files, callers who wanted to read just what is in the file now would have to do read(myfile, filesize(myfile))
. To be consistent with the current doc, eof()
for a simple local file should probably be false/block unless the file is unlinked and there are no other open write file handles on the file.
I understand that a very common use case for simple files is to want to read to the end of what is in the file now. However, the less common use case of reading form a file that is occasionally appended to by a 3rd party needs to be handled too.
Consider also that the IOStream
returned by the simple open()
function is not always a simple file. See the example below of opening a named pipe (mkfifo
). In this case eof() = true
seems wrong because a pipe can be expected to have more data at some future time. Likewise, the read(io, 6)
call should have blocked. If I'm reading a fixed number of bytes from a named pipe it is probably because there is an agreed protocol where I'm expecting the other end to write that many bytes.
| type | File | NamedPipe |
| --- | --- | --- |
| init | echo Hello > file | mkfifo fifo; echo Hello > fifo |
| | open("file") | open("fifo") |
| typeof(io) | IOStream | IOStream |
| isopen() | true | true |
| eof() | false | false |
| read(io, 5) | "Hello" | "Hello" |
| eof(io) | true | true |
| "Again" | echo Again >> file | echo Again > fifo |
| eof(io) | false | false |
| read(io, 6) | "Again" | "Again" |
| eof(io) | true | true |
Maybe the answer is to have a user selectable blocking/non-blocking mode (O_NONBLOCK
).
Maybe it would be better to have clearly distinct types for simple-file-like-things and stream-like-things.
Or maybe it's best to enforce blocking everywhere and encourage the use of tasks and/or patterns like read(myfile, filesize(myfile))
.
Another alternative would be to have a WouldBlockError
exception that would be thrown by anything that would block when the "i don't want things to block" option has been set (this would allow writing code that acts as if nothing ever blocks, but not have it deadlock in odd ways when there is a problem).
For IOBuffer it seems that eof() just always returns true.
What does this mean? eof
of IOBuffer
may return false
.
julia> buf = IOBuffer(b"foobar");
julia> eof(buf)
false
Hi @bicycle1885,
I think the difference is that my test script uses the IOBuffer()
constructor.
So, I guess my finding is that for the default IOBuffer()
constructor, it seems that eof() always returns true for the series of function calls in my test scenario. e.g.
julia> io = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)
julia> eof(io)
true
julia> write(io, "Hello")
5
julia> eof(io)
true
julia> write(io, " World!")
7
julia> eof(io)
true
julia> read(io)
0-element Array{UInt8,1}
julia> String(take!(io))
"Hello World!"
write
changes the offset of a buffer so it is quite natural that a sequence of write
operations does not change the return value. However, seek operations (e.g. seekstart
) will change the return value:
julia> buf = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)
julia> write(buf, b"foo")
3
julia> eof(buf)
true
julia> seekstart(buf)
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=3, maxsize=Inf, ptr=1, mark=-1)
julia> eof(buf)
false
@bicycle1885 I accept that the behaviour of IOBuffer()
in isolation could seem "quite natural" in a certain context. i.e. If you expect it to behave like a local disk file. However, as things stand the documentation about how each IO type is supposed to behave is not clear.
The point of posting this issue is to highlight behaviour inconsistencies between the various IO types. I realised that I'd kind of figured out how to use the IO types I was familiar with, but that I this was largely a matter of trial and error and copying usage patterns from other code. When it came to trying to debug the HTTP.jl package, and coming up against questions of "how should function this behave", it wasn't a simple matter of "read and follow the spec". The task seemed to come down to figuring out which of the many IO variants in Base was most like what we wanted and emulating it as best we could.
Maybe the solution is just to have clearer documentation, or as I said above "Maybe it would be better to have clearly distinct types for simple-file-like-things and stream-like-things.". Maybe there needs to be seperate documentation for eof(::AbstractFile)
and eof(::AbstractStream)
.
This is a great analysis and we do want to make sure all of this is sane and consistent, but it's a bit too much to bite off for the impending 1.0 release, so based on discussion on triage, we're going to say that I/O blocking behavior is not yet stable and may change in the 1.x series. We will have to be careful to make sure that any changes we make don't break key libraries (e.g. HTTP).
Luckily, HTTP.jl currently uses it's own streaming buffer type (HTTP.FIFOBuffer
), so that should keep it arms-distance from changes in Base until things settle.
Maybe we need something like @oxinabox's https://github.com/oxinabox/InterfaceTesting.jl for IO.
A bit ago I wrote up some notes trying to wrap my brain around the streaming interface in the context of trying to make my SampledSignals.jl stream semantics better match Base. Definitely not as nicely-organized as @samoconnor's work here, but might be useful.
Also relevant is JuliaIO/FileIO.jl#78, where I'd like to support opening up media files in a streaming way, where you end up with a stream of images or audio samples, rather than a stream of bytes.
[edit: premature submission and wrong issue link]
I suspect if we add WebSocket
to this analysis, it probably would not fare too well 😅
The main difference between the way I designed the SampledSignals streams and Base is that I don't have an eof
method, and streams will always block on a [read|write] if there's not enough [data|space] available. So reading to the end of a stream looks like:
const N = 4096 # amount to read each time
while true
buf = read(stream, N)
# do stuff with buf
length(buf) == N || break
end
read!
and write
both return the number of elements [read|written], which you can check similarly. This works well in the context of a stream that's always flowing until it reaches its final end (like an audio device), but maybe not in the case where you have occasional bits and pieces of unknown size coming in and want to process them in a timely manner (though maybe in that case you just set N=1
?)
There are three fundamental aspects to a read*
API call:
A fourth aspect is return type (e.g. read(io, T)
converts the bytes read to type T and read!(io, array)
fills in a fixed type array). However, this is largely handled by generic wrapper methods and should not influence the design of the fundamental API too much.
The current API looks like this:
| Function | Allocation | Termination | Blocking |
| ----------------------- | ---------------------- | --------------------- | ----- |
| unsafe_read(io, p, n)
| copy | n
bytes | ✅ |
| read!(io, array)
| copy | sizeof(array)
bytes | ✅ |
| read(io, n)
| alloc | n
bytes | ❌ * |
| readbytes!(io, b, n)
| copy (with realloc **) | n
bytes | ❌ * |
| readuntil(io, delim)
| alloc | delim | ✅ |
| readavailable(io)
| alloc | 1 byte | ✅ |
| read(io)
| alloc | eof()
| ✅ |
| read(io, String)
| alloc | eof()
| ✅ |
| read(io, T)
| alloc | sizeof(T)
bytes | ✅ |
There are a few gaps and inconsistencies:
unsafe_read
.read!(io, array)
is blocking but read(io, n)
and readbytes!(io, b, n)
are not.readavailable
sounds like it should be non-blocking but the spec says to wait for 1 byte.read(::IOStream, nb)
and readbytes!(::IOStream, b, nb)
both wait for nb
bytes (which contradicts the generic spec). See https://github.com/JuliaLang/julia/issues/17070#issuecomment-423008560readbytes!(io, b, nb)
sometimes reallocates b
. Suggested refinements to better support the various combinations of allocation/termination/blocking :
Change readavailable(io)
to be non-blocking.
while !eof(io) process(readavailable(io)) ; end
) it is the job of eof
to be blocking. It seems odd for readavailable
to block as well.Change readbytes!(io, b, n)
and read(io, n)
to be always blocking.
all=true
default that the ::IOStream
methods use.readavailable
all read*
methods are blocking".Add readavailable(io, n)
, readavailable!(io, b, n)
and unsafe_readavailable(io, p, n)
.
read(io, n)
, read!(io, b, n)
and unsafe_read(io, p, n)
.all=
option of ::IOStream
methods to be removed.Add wait(io) = (eof() && throw(EOFError); nothing)
:
wait(::RawFD)
and wait(::Channel)
. eof(io)
is the way to wait for an IO stream is not immediately obvious.Another thing: readbytes!
stands out as an odd name now that readstring
has been replaced by read(io, String)
.
Why not replace readbytes!(io, b::AbstractVector{UInt8}, n)
with
read!(io, b::AbstractVector{UInt8}, n)
?
After the refinements proposed above, we would end up with this:
| Allocation | Termination | Blocking | Function |
| ---------- | ----------- | -------- | ------------------------------------------------------------ |
| alloc | n bytes | ✅ | read(io, n)
read(io, T)
|
| alloc | n bytes | ❌ | readavailable(io, n)
|
| alloc | eof | ✅ | read(io)
|
| alloc | eof | ❌ | readavailable(io)
|
| copy | n bytes | ✅ | read!(io, b, n)
read!(io, array)
unsafe_read(io, p, n)
|
| copy | n bytes | ❌ | readavailable!(io, b, n)
unsafe_readavailable(io, p, n)
|
| copy | eof | ✅ | read!(io, b, typemax(Int))
|
| copy | eof | ❌ | readavailable!(io, b, typemax(Int))
|
it's a bit too much to bite off for the impending 1.0 release
@StefanKarpinski where do you see this fitting into the 1.x
release plan?
This kind of bug comes up quite frequently: https://github.com/JuliaWeb/MbedTLS.jl/issues/186 where it kind of works but seems a bit glitchy or slow sometimes. The root cause is often that someone wrote some generic code that assumed a read would be non-blocking, but by the time the call makes its way through a few generic base/io.jl
methods it ends up in an external library read method that is blocking (or that blocks in some situations).
Julia should be a great language for building easily compossible pieces of IO machinery; and for building data consumers and producers that work regardless of where data came from (or how it was buffered, encrypted, proxied, cached, etc). But, unless the Base.IO
API has unambiguous and complete behaviour contracts, it will never work reliably.
I don't know, I'm not really the person to ask. @JeffBezanson, @vtjnash, @Keno, opinions on this?
Awesome work - I think this is a really nice and consistent architecture, and I've also been bitten by the current state of things. I'm doubtful that this would be a 1.x sort of thing though, right? Seems like some of these changes would be breaky enough that they'd need to wait until 2.0.
read!(io, array)
listed under "alloc"? Shouldn't that be a "copy" function?read!(io, b, n)
, do you need the n
, or do you just pass in a View
if you want to fill a subset of your array? (I guess it's useful for larger n
)?read!(io, b, n)
, what do you do if n > length(b)
? Do we keep the current behavior of reallocating a larger b
? I'm guessing yes because otherwise I don't think the typemax(Int)
versions make sense, but I wanted to make sure. I'd be happy dropping this behavior, never reallocating, and also removing the n
argument, but perhaps the auto-reallocation comes in handy for some folks.unsafe_read
be unsafe_read!
? (also unsafe_readavailable!
)read!(io, b, typemax(Int))
? If it's going to be a sentinel value maybe -1
is better because it's not a valid length. (again this becomes a non-issue if the auto-reallocation behavior is removed).A fourth aspect is return type (e.g. read(io, T) converts the bytes read to type T and read!(io, array) fills in a fixed type array). However, this is largely handled by generic wrapper methods and should not influence the design of the fundamental API too much.
Another asymmetry in the allocating/non-allocating versions is when you want to get a Vector{T}
, e.g. you can do read!(io, arr::Vector{Float64})
and it will fill the array, but there's no read
equivalent. Perhaps there could also be a read(::IO, ::Type, ::Integer)
method that would return a Vector
? So in that case it could be defined read(io::IO, T::Type=UInt8, n::Integer)
.
In the typical use pattern (
while !eof(io) process(readavailable(io)) ; end
) it is the job ofeof
to be blocking. It seems odd for readavailable to block as well.
I agree it's weird for readavailable
to block if there are 0
bytes available, but TBH it also seems weird for eof
to block - checking whether a stream has ended seems orthogonal to waiting for data to be available. If anything I'd propose making the wait(io)
function be the way to wait for data, and have eof
be nonblocking.
- Why is read!(io, array) listed under "alloc"? Shouldn't that be a "copy" function?
Yes, corrected above.
- In read!(io, b, n), do you need the n, or do you just pass in a View if you want to fill a subset of your array?
I think we should keep the n
(and probably add copyto!(b, offset, io, n)
).
If the compiler (and GC) becomes smart enough one day to ensure that View
never causes allocation, then those could become undocumented internal functions and the doc could say "just use a View". But I think it makes sense to defined the IO primitives in terms of the simplest possible types.
- In read!(io, b, n), what do you do if n > length(b)? Do we keep the current behavior of reallocating a larger b?
To me having a read!
method that reallocates its argument smells a bit wrong. I would prefer to seperate the auto-sizing by having a special AbstractArray
type that auto-grows on access; or by just calling sizehint!(array, bytesavailable(io))
before the read.
I'm not sure that I understand what use case the auto-sizeing is supposed to help with.
Consider this scenario:
buf = Vector{UInt8}(undef, 4096)
while !eof(io)
readbytes!(io, buf)
render(my_display, buf)
end
I start out reading small chunks of data and achieving low output latency.
This works nicely until my code gets busy doing something else and a few MB end up in the OS input buffers...
Then my next read sees that there is lots of data available and ends up reallocating my buffer to several MB!
Worse, after that every subsequent readbytes!
call is now blocking to wait for several MB (because the default nb
is length(buf)
and readbytes!
is blocking).
- Should unsafe_read be unsafe_read!? (also unsafe_readavailable!)
The current unsafe_read
has no bang. I have no strong opinion on this. The bang convention is not applied very consistently as far as I can see.
- ... this becomes a non-issue if the auto-reallocation behavior is removed
That would be my preference.
Perhaps there could also be a read(::IO, ::Type, ::Integer) method that would return a Vector? So in that case it could be defined read(io::IO, T::Type=UInt8, n::Integer).
That sounds pretty sane. I guess the readavailable
version would read what is available modulo the size of T
.
it also seems weird for eof to block - checking whether a stream has ended seems orthogonal to waiting for data to be available. If anything I'd propose making the wait(io) function be the way to wait for data, and have eof be nonblocking.
Yes. I agree. eof(io)
should be just !isreadable(io) && bytesavailable(io) == 0
.
... and this brings up the issue of separating open/closed state for read and write halves of the stream.
(closeread(io)
ensures isreadable(io) == false
, closewrite(io)
ensures iswriteable(io) == false
?)
Question, as of 2020-02-11 what is the recommended, geneeral way to read from a stream multiple times onto a preallocated array (possibly without extra allocations)? (eg, if read provided the start/end range)
buf = Vector{UInt8}(undef, 30)
read!(buf, start=1, end=10)
read!(buf, start=11, end=20)
read!(buf, start=21, end=30)
Would slicing with view
do the job in Julia? In C you would do this with pointer arithmetics,
ssize_t read(int fd, void *buf, size_t count);
size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream);
Most helpful comment
I'm not an IO guru so have nothing concrete to add, but just have to say wow, really nice analysis.