Julia: Julia 1.3 order of magnitude slowdown in function

Created on 25 Dec 2019  Â·  33Comments  Â·  Source: JuliaLang/julia

I'm noticing an order-of-magnitude slowdown in a custom function to parse "sample list" files (one ASCII value per line) to Float32 arrays. Any ideas what could be causing this?

Offending function here: https://github.com/jpjones76/SeisIO.jl/blob/master/src/Utils/Parsing/streams.jl

Affects my ASCII readers for SLIST and Lennartz SLIST files.

Before 1.3.0, X[i] = stream_float(io, 0x00) was 2-3x faster than v = readline(io); X[i] = parse(Float32, v). Now stream_float is 4x slower than readline + parse. I actually need the old speeds if possible, so any alternate syntax suggestions would be welcomed.

MWE

using Pkg; Pkg.add("BenchmarkTools"); Pkg.add("SeisIO") 
run(`wget https://github.com/jpjones76/SeisIO-TestData/raw/master/SampleFiles/ASCII/0215162000.c00`)
using BenchmarkTools
import SeisIO:stream_float
io = open("0215162000.c00", "r")
nx = 225000
X = zeros(Float32, nx) 
@benchmark (seekstart($io); readline($io); @inbounds for i in 1:$nx; $X[i] = stream_float($io, 0x00); end)

Example outputs follow. I benchmarked this on a Dell laptop running Ubuntu 18.04 (bionic), kernel 5.0.0-37-generic, CPU Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz, 16 GB RAM.

Julia 1.0.5

BenchmarkTools.Trial: 
  memory estimate:  272 bytes
  allocs estimate:  2
  --------------
  minimum time:     25.533 ms (0.00% GC)
  median time:      25.701 ms (0.00% GC)
  mean time:        25.867 ms (0.00% GC)
  maximum time:     30.592 ms (0.00% GC)
  --------------
  samples:          194
  evals/sample:     1

Julia 1.2.0:
```BenchmarkTools.Trial:
memory estimate: 272 bytes
allocs estimate: 2


minimum time: 24.738 ms (0.00% GC)
median time: 24.850 ms (0.00% GC)
mean time: 24.910 ms (0.00% GC)
maximum time: 27.172 ms (0.00% GC)


samples: 201
evals/sample: 1

Julia 1.3.0:

BenchmarkTools.Trial:
memory estimate: 272 bytes
allocs estimate: 2


minimum time: 236.738 ms (0.00% GC)
median time: 237.059 ms (0.00% GC)
mean time: 237.833 ms (0.00% GC)
maximum time: 245.997 ms (0.00% GC)


samples: 22
evals/sample: 1
```

O performance regression

Most helpful comment

@jpjones76, having quite a bit of low-level IO parsing work myself, I'm happy to talk about strategies here (I'm the primary maintainer of Parsers.jl, CSV.jl, JSON3.jl, etc.).

In particular, I've found that to maximize parsing performance, the absolute fastest is to mmap the input file (or Base.read into a Vector{UInt8}), and read byte-by-byte, maintaining your own position variable and manually checking that position < len(input) after incrementing the position. Just dropping down to using your own Vector{UInt8} for parsing has better memory cache-line use, and avoids a whole host of extra checks that occur in simple operations for generic IO objects like position(io), read(io, UInt8), and even eof(io).

Those are just a few ideas/tactics I use to achieve anywhere from 2x to 50x performance gains in Parsers.jl vs. Base.

Happy to discuss more in-depth on the julialang slack (I'm pretty active there in the #data channel) or even do a google hangout if that would be helpful to walk through code, ideas, or strategize how to get even better performance for your code.

All 33 comments

read is now thread safe which requires acquiring a lock so reading things byte by byte has slowed down. You could perhaps try https://github.com/BioJulia/BufferedStreams.jl. I do think it would be good to have a goto solution for this in Base (perhaps there already is and I just don't know about it).

Are you sure that's the problem? My other readers have very little slowdown by comparison; even my other ASCII readers are slowed no more than a factor of two.

Does unsafe_read avoid this slowdown?

Is there any startup flag or other option that can prevent this behavior? Seems like this should be very impactful if it's causing that much slowdown on every file read in every package.

BufferedStreams.jl is not a solution for large files as it doubles their memory use. My colleagues are working with GB size files for things like ambient noise correlations. If I try to buffer the entire file, it eats up all their memory and hangs Linux forcing a hard reboot.

No I am not sure. I just know it has been a slowdown in codes that look similarly. You would have to test it.

BufferedStreams.jl is not a solution for large files as it doubles their memory use. My colleagues are working with GB size files for things like ambient noise correlations. If I try to buffer the entire file, it eats up all their memory and hangs Linux forcing a hard reboot.

Ok but why would you buffer the entire file?? The default in BufferedStreams is 128 KB I think.

I don't think thread safety alone is responsible and would rather not debate the relative merits of using a third-party package to compensate for perceived slowdown in Base when neither of us is certain that said package is the only workaround.

For sake of due diligence, I re-ran my file-read benchmarks from the publication (https://github.com/jpjones76/SeisIO_paper) in Julia 1.0.5 and Julia 1.3.0. My GeoCSV file-format reader uses similar low-level ASCII parsing and is slowed by ~2.1x, but SLIST slowdown is 8.2x. If thread safety was the only problem, they'd be similar. They're off by a factor of four.

I don't think thread safety alone is responsible and would rather not debate the relative merits of using a third-party package to compensate for perceived slowdown in Base when neither of us is certain that said package is the only workaround.

The point is to introduce a buffered stream to actually check and see if this is a dup of https://github.com/JuliaLang/julia/issues/33335 or https://github.com/dancasimiro/WAV.jl/issues/52#issuecomment-527745083. If a buffered stream doesn't restore performance then great, one less thing to check.

What you also can do is to profile the code (https://docs.julialang.org/en/v1/manual/profile/) in the two different versions and see if there is anything that sticks out when comparing the traces.

From what I can see in profiling, everything points to locking and unlocking in eof, skip, read.

I'm still looking into it, but quite frankly, these changes where every low-level read operation is wrapped in a lock macro are ridiculous. They also invalidate the biggest reason for my most recent submitted manuscript, and the three years of work I put into SeisIO.jl before it.

The main point of my publication, and in fact, one of the main points of spending the last three years working as hard as I have on SeisIO.jl, is that Julia allowed me to create the fastest seismic data parsers on the planet. My code forms the basis of ongoing research at half a dozen institutions including Harvard, the US Geological Survey, and CalTech. We can run code in ten minutes that used to take days.

This change utterly trashes my code...and, by extension, our work...and their research.

You've made Julia parsing slower than Python.

You've made Julia parsing slower than R -- slower than _native_ R, let alone those trash wrappers to undocumented C++ that third-year undergrads like to pretend are R packages.

You're even slower than Matlab and Octave, aren't you?

There is no sugarcoating this: the 1.3 approach to thread-safe file I/O may be the worst coding decision I have ever seen.

Is there no adult in the room when development decisions get made? Did it not occur to anyone that that there _might_ still exist scientific data formats that need low-level read functions for structures _other than_ long arrays?

Was it _that_ difficult to make the lock checks optional?

I'm pretty sure I know the answer to that last question, because you sure seem to have no trouble wrapping parts of your low-level functions in a lock macro -- readbytes_all! in iostream.jl has a good example, if @lock_nofail can wrap a control loop adequately it should damn well be able to wrap the entire function. So why not instruct users to do that?

...now you tell me that the solution is to either (a) add a third-party package, to achieve worse performance than Base had six weeks ago, or (b) rewrite every single low-level read operation to use C wrappers?

How is that consistent with the language's design goals, again? Remind me about ease of use, in particular. I'm having trouble remembering your exact wording at this late hour.

...now you tell me that the solution is to either (a) add a third-party package, to achieve something that was part of Base six weeks ago, or (b) rewrite every single low-level read operation to use a C wrapper?

Just to be clear, the suggestion about BufferedStreams.jl was merely a way to check if the new thread-safe IO was the source of the regression since it wasn't clear from the original issue (no MWE or profiling). It wasn't meant to be a "solution" to the issue.

@jpjones76, you are in violation of the community standards. Please moderate your anger in future posts.

I think there are good reasons why I'm having a hard time being constructive about this issue. What am I supposed to tell my colleagues?

If low-level read commands are the cause, it's going to take weeks to recode in a way that restores performance to 1.2 levels.

One ASCII reader is not a huge deal in and of itself. However, this change triples the time required to parse SEED, the worldwide seismic data standard (https://ds.iris.edu/ds/nodes/dmc/data/formats/seed/ , http://www.fdsn.org/pdf/SEEDManual_V2.4.pdf). _That_ is huge. Our data sets are TB to tens of TB each experiment. The low-level structure of SEED comprises variable-length packets whose sizes aren't easily guessed, with no byte index. A simple workaround like reading to an intermediary buffer is unlikely to help.

I think there are good reasons why I'm having a hard time being constructive about this issue. What am I supposed to tell my colleagues?

To use Julia 1.2 until folks have had a chance to come back from Christmas day and take a look at this bug report?

We all understand being distressed by a change that affects your code, but that does not give you license to unload a bunch of personal attacks. So yes, I do expect you to be constructive about this issue. You were doing great at the beginning, keep to that tone and we'll all be much happier.

@jpjones76 while your frustration is understandable, the way you have chosen to express it will only make it difficult for people to engage or to help. Perhaps you can help isolate the issue and be patient while people are on vacation.

I request you to revise your comments to be helpful, since perhaps you made them at a later hour, you may not want to leave them as they are for the future.

I was delighted to read about your application and collaboration but it quickly became unreadable.

We faced the following difficult decision: either (1) cause data corruption when threads are added to code that does file I/O, or (2) slow down small I/O operations. We chose the latter because reliability is important, and because it is rare to do byte-at-a-time I/O (especially when performance matters).

Particularly at large scale, it is often impossible for one system to meet everybody's needs. That is part of the point of Julia --- you are not stuck with whatever is in the base system. If using BufferedStreams.jl fixes this, what's the problem? In fact I suspect there is room for even further performance gains over what you achieved before.

(b) rewrite every single low-level read operation to use C wrappers?

I certainly wouldn't recommend that, since libc's buffered I/O also uses locking and is still significantly slower than ours. But there are other good options. For example if necessary, you can copy the code from iostream.jl from julia 1.2 into your package. A custom buffered I/O type is also worth exploring; you might get performance gains by dropping some of the generality of existing I/O packages.

I do think it would be good to add an option to turn off locking, though I'm not sure what the API should be. Unfortunately we can't add it until v1.5 though.

@jpjones76, having quite a bit of low-level IO parsing work myself, I'm happy to talk about strategies here (I'm the primary maintainer of Parsers.jl, CSV.jl, JSON3.jl, etc.).

In particular, I've found that to maximize parsing performance, the absolute fastest is to mmap the input file (or Base.read into a Vector{UInt8}), and read byte-by-byte, maintaining your own position variable and manually checking that position < len(input) after incrementing the position. Just dropping down to using your own Vector{UInt8} for parsing has better memory cache-line use, and avoids a whole host of extra checks that occur in simple operations for generic IO objects like position(io), read(io, UInt8), and even eof(io).

Those are just a few ideas/tactics I use to achieve anywhere from 2x to 50x performance gains in Parsers.jl vs. Base.

Happy to discuss more in-depth on the julialang slack (I'm pretty active there in the #data channel) or even do a google hangout if that would be helpful to walk through code, ideas, or strategize how to get even better performance for your code.

@JeffBezanson said:

I do think it would be good to add an option to turn off locking, though I'm not sure what the API should be.

I do wonder what the use-case would be?

It seems to me that the only safe situation is where the user controls every line of code they use and knows no threaded IO happens (possible but rare), or the user examines every package and library they use, and reexamines them on upgrades, and ensures that no threaded IO happens (possible but even rarer).

I'm not saying not to add the feature, but do document it with a big red "here be (really sneaky mean) dragons".

What about

struct IOSingleton
    #= fields needed for I/O =#
    threadid::Int
end

function write(s::IOSingleton, b::UInt8)
    iswritable(s) || throw(ArgumentError("write failed, IOStream is not writeable"))
    Threads.threadid() == s.threadid || error("can only be called from thread ", s.threadid, ", got ", Threads.threadid())
    ccall(:ios_putc, Cint, (Cint, Ptr{Cvoid}), b, s.ios)
end

I think I remember trying something like that, and even just checking the thread id had almost as much slowdown (presumably due to the extra load, which is significant when done per-byte).

Seems like inlining + hoisting should solve that for the case where the byte-by-byte IO is being done from an individual function.

@elextr, I'm so glad that you asked about use cases. I do apologize for my tone last week addressing @KristofferC , but I must ask, here: how much scientific data have you worked with, personally? Which disciplines?

In other words, what are the use cases for which thread-safe reads are preferred? Am I correct in thinking that the intended change was primarily to benefit parallel reads of large files?

If so, what percentage of the scientific data that you've worked with uses a file format where this would be a benefit?

I'll discuss the general and specific use cases that the Julia seismology community faces below.

General use case: Seismology

In geophysics, speaking from 22 years of working with seismic data, my colleagues at IRIS can verify that >99% of extant seismic data are divided into short-record file formats, for which it's faster to read without parallel.

In all common geophysical file formats, a descriptive header with many scalar variables must be read before data can be used. Parallel read offers no advantage for file headers. Parallel read may be an advantage when reading a long, contiguous time series from file, but how often are time series recorded in long, contiguous records?

I have a rough sense of the answer to that. I'm not sure you want my guess. Even without a discussion of transmission breaks and data gaps, many formats allow maximum record lengths <32 kB. Thus, the case for single-worker reads in geophysics should be absolutely clear.

How common a problem?

Virtually all seismic data uses short-record file formats, totaling hundreds of PB to tens of EB of data archived publicly and privately over 40 years. A few example data formats are described in my SeisIO documentation, but there are others.

Does any seismic data benefit from these changes?

I know of only one seismic data format where (thread-safe) parallel reads offers a performance advantage: ASDF by Krischer et al. (2016). This is not the ASDF in astrophysics. Seismic ASDF is used by only two research groups; AFAIK, neither operates a seismic network in such a way that monitoring records use it for archival. Thus the relative fraction of data whose reads would be improved by 1.3 could be called "0.0f0", i.e., less than 1/10^7.

Specific use case: SEED

SEED is the worldwide seismic data standard. Files are divided into variable-length records (see pages 16-17 of the linked PDF), wherein each record comprises a fixed-length header (48 bytes), a variable number of blockettes (variable length), and a variable amount of data (variable length).

The problem should already be clear, but here are details, as I hope to prevent any argument about the intractability of reading SEED archives in parallel.

SEED read and buffer

One cannot merely pre-buffer an entire record. Instead, a conceptual schematic of an optimized read process looks like this:
[read 48 bytes] --> [extract # of blockettes, N] --> for each blockette: [get blockette size K in bytes (*); buffer blockette] --> [get length of data] --> [buffer data] --> [process]

The problem

At (*) lies the issue. Blockette size is stored at the start of the blockette itself. Some blockette types are variable-length. Thus, 2-3 reads are necessary to buffer (or skip) each blockette: blockette type, blockette length K, and K-7 bytes to buffer the blockette itself. This must be done for _each_ blockette -- up to 256 per record. So each record can require up to 770 read() calls, _in addition to_ buffering: 1 for the 48-byte fixed header, 768 for the blockettes, 1 for the data.

Workaround results

Switching to BufferedStreams.jl has been benchmarked by our group and shows 50% slowdown from 1.2 read speeds for SEED. This is unacceptable as the data to be read and processed are hundreds of TB. With a slowdown worse than ~30%, it's faster to use the SEED C library; but this also adds more memory overhead than my Julia-language SEED reader. Moreover, we cannot believe that developers would endorse a potential solution that undermines the use of Julia as a language, particularly considering that my reader was 25% faster than the C library until November 26th.

SEED conclusion

In this file format, the order-of-magnitude slowdown introduced by locking and unlocking is not tenable.

Reminder: SEED popularity

This "use case" represents every byte of seismic data available from every FDSN seismic observatory, which, in turn, represents the majority of univariate geophysical data that are available to the public.

I hope this now hints at the magnitude of the problem (no pun intended). In seismology, virtually all data get recorded in formats where parallel read offers no advantage whatsoever.

Just as a fun experiment I did an attempt with the Mmap strategy that @quinnj outlined in https://github.com/JuliaLang/julia/issues/34195#issuecomment-569343440 with something like:

using Mmap
mutable struct MmapReader <: IO
    data::Vector{UInt8}
    position::Int
end
MmapReader(file::String) = MmapReader(Mmap.mmap(file), 1)

Base.eof(io::MmapReader) = io.position > length(io.data)
Base.position(io::MmapReader) = io.position
function Base.read(io::MmapReader, ::Type{UInt8})
    # @inbounds below is not safe in general
    v = @inbounds io.data[io.position]
    io.position += 1
    return v
end
Base.seekstart(io::MmapReader) = io.position = 1
Base.skip(io::MmapReader, i::Int) = io.position += i

and then ran the benchmarks as

using BenchmarkTools
io = MmapReader("0215162000.c00")
nx = 225000
X = zeros(Float32, nx) 
@btime (seekstart($io); @inbounds for i in 1:$nx; $X[i] = stream_float($io, 0x00); end)

with the result

  7.522 ms (0 allocations: 0 bytes)

(note that I removed the readline call and the first line of the file because it wasn't relevant for the experiment).

So it might be a strategy worth exploring if performance is really important.

Just as a fun experiment I did an attempt with the Mmap strategy that @quinnj outlined in #34195 (comment) with something like:

using Mmap
mutable struct MmapReader <: IO
    data::Vector{UInt8}
    position::Int
end
MmapReader(file::String) = MmapReader(Mmap.mmap(file), 1)

Base.eof(io::MmapReader) = io.position > length(io.data)
Base.position(io::MmapReader) = io.position
function Base.read(io::MmapReader, ::Type{UInt8})
    # @inbounds below is not safe in general
    v = @inbounds io.data[io.position]
    io.position += 1
    return v
end
Base.seekstart(io::MmapReader) = io.position = 1
Base.skip(io::MmapReader, i::Int) = io.position += i

and then ran the benchmarks as

using BenchmarkTools
io = MmapReader("0215162000.c00")
nx = 225000
X = zeros(Float32, nx) 
@btime (seekstart($io); @inbounds for i in 1:$nx; $X[i] = stream_float($io, 0x00); end)

with the result

  7.522 ms (0 allocations: 0 bytes)

(note that I removed the readline call and the first line of the file because it wasn't relevant for the experiment).

So it might be a strategy worth exploring if performance is really important.

That's certainly good performance, but I hope it's clear that this wasn't the data format I was talking about in my use case for @elextr at all.

SLIST and its variants are ASCII formats used as safe or fallback options. SEED is the worldwide seismic data standard and is generally not ASCII. They aren't equivalent.

This might be a workaround for the specific format mentioned in the original issue, and I'll happily test it as such, but that doesn't make it a general solution.

@quinnj's and @KristofferC's point is that anything you can do with an IOStream, you can do with a mmapped stream too. With better performance and circumventing the very issue that is worrying you.

@jpjones76 I'm afraid you may have misinterpreted the point of my question, it wasn't to elicit your use-case but to urge careful consideration of adding a general feature (turning off IO locking) of a kind that my experience has shown to be more often a trap than a use.

Unless @JeffBezanson was suggesting that there is a performant way to turn off locking per-stream, then, as I noted, when turning locking off the user needs to ensure no packages, or Julia library code they use does any threaded IO, since the setting will affect all IO not just the specific IO they intended to improve the performance for.

As a hypothetical example, if the AI package you decide to use to analyse your seismic data is multi-threaded, and does IO, then it is likely to fail in mysterious ways if IO locking is turned off to improve your parser.

As to your specific use-case, without having analysed your specific format, even for a binary format memory mapped IO is likely to provide better performance than using low level IO calls, and I have had experience of better performance even if non-mmapable streams are simply fully loaded into memory first. Probably for some of the reasons @quinnj alluded to. So I would recommend not rejecting such approaches without benchmarking them yourself as @KristofferC did for one of your formats.

@quinnj's and @KristofferC's point is that anything you can do with an IOStream, you can do with a mmapped stream too. With better performance and circumventing the very issue that is worrying you.

I mean no offense but are you certain you're comfortable recommending a solution with this amount of hacking to restore the performance of basic functions? In addition to the posted workaround, I'll need to manually define read methods for other bits types. I realize I haven't benchmarked the two solutions side-by-side yet, but I'm not sure how that's less work (or better performance) than wrappers to Julia C functions.

Unless @JeffBezanson was suggesting that there is a performant way to turn off locking per-stream

Yes, of course that's what I meant. The idea of turning off all locks globally would never be up for consideration. I've never heard of such a thing.

are you certain you're comfortable recommending a solution with this amount of hacking

Yes. Why is it "hacking"? Do you want fast byte-at-a-time reads or not? As he said, @quinnj does all sorts of tricks like this to max out performance for CSV reading etc., and the code is robust and widely used.

In Base there are fallbacks for reading basically everything in terms of UInt8, so if a stream supports that everything should work. A few more custom methods might be needed for the best performance though of course.

are you certain you're comfortable recommending a solution with this amount of hacking

Absolutely. "System libraries" must be safe for a wide range of applications, many of which are not envisioned by the designer of the library. In contrast, code tuned for a particular purpose does not have to meet the same standard. One of the biggest advantages of Julia is that you, the user & package developer, are not a second-class citizen: you can do state-of-the-art work without having to dive down into C/C++/fortran/assembly. And the composability of packages (another Julia strength compared to the competition) means that you likely aren't on your own. I don't know Parsers.jl, but it looks like there's a bunch of stuff in there that would likely be useful to someone interested in writing fast parsers.

You seem to have an attraction to C wrappers, but folks in this thread are telling you that's probably the wrong approach. Folks have posted benchmarks (backed by real-world experience) suggesting that, if you really value speed, your goal should not simply be to recover the performance you had with Julia 1.2: that your goal should be to get another 3x performance on top of what you've already achieved. Moreover, they've already provided open-source packages, and kind offers of assistance on forums, to help you do it. What's holding you back?

performant way to turn off locking per-stream

Should be simply h = lock(open("the file")), but looks like that's currently not implemented (defined, just not yet optimized).

comfortable recommending a solution with this amount of hacking

I'm not entirely sure either why he jumped straight to testing a reimplementation of the existing IOBuffer type (although that is a relatively complete implementation of an IO object). That code should behave about the same as the pre-written one:

using Mmap
MmapReader(file::String) = IOBuffer(Mmap.mmap(file))

(I'm not sure I'd call either one hacking, since their implementations are pretty similar)

Summary of the situation. The choice that was faced in Julia 1.3, given that everything in the language was now potentially multithreaded, was between the following two options:

  1. make I/O thread-unsafe by default, making lots of existing code crash without warning, or
  2. make I/O thread-safe by default, making code using byte-at-a-time I/O significantly slower.

The former is not viable: we can't just make existing code broken without warning. The latter option was deemed better since it doesn't involve making existing code crash and byte-at-a-time I/O is not the best way to do really high-performance I/O anyway. It's also possible that some future compiler work could recover much of the performance lost by locking on each I/O operation.

This is the same tradeoff that libc makes: its I/O functions are also locked so as to be thread-safe. People don't generally do byte-oriented I/O in C because the API doesn't encourage it and because its slow. Julia's byte-oriented I/O in 1.2 and earlier is, as you've noticed, faster than C's, which has perhaps encouraged people like yourself to use it. A couple of things should be noted:

  1. Rewriting your code to call C I/O functions directly will not recover the performance you had in Julia 1.2 since C's I/O functions are slower than Julia's in that version (because of locking).

  2. The choice to have I/O be threadsafe and locking by default is not exactly the reckless, idiotic decision you seem to be making it out to be when you say things like "Is there no adult in the room when development decisions get made?" Unless you think the same thing about the C standards committee. Which maybe you do, in which case perhaps you should get into the PL design game yourself. (Be forewarned: you will get abuse on the Internet from people using your work for free when said free work temporarily inconveniences them.)

We were well aware of the fact that this choice could slow down applications that relied on byte-at-a-time I/O. However, it was unclear if any of those applications would be performance critical, especially since byte-oriented I/O is _not_ the best approach for really high performance. Rather than committing to an API that we might not even need, we decided to wait and see if it was necessary or not. Your issue is the first report we've gotten about someone doing byte-oriented I/O where this slowdown has been a problem. We will consider this issue as a request for allowing each stream to optionally not do locking.

As to "Was it that difficult to make the lock checks optional?" Well, yes, it's a lot of work to make each stream support optional locking in a way that makes unlocked streams as fast as they were on 1.2. Even making seemingly trivial changes to Base Julia is quite a lot of work. Making a change like this correct, fast and general is a significant undertaking. If you feel that it's easy enough that you could do it without too much difficulty, please do make a PR for it. I'm sure someone would be willing to help review your work, so long as you're more civil than you have been in this thread.

Regarding your options, the simplest is: don't upgrade. Just keep using Julia ≤ 1.2 — no work is required, just cap the Julia version in your project file. A slight varation on that is to just put a note in the README indicating that Julia 1.3+ is supported by may be slow. If users want better performance, they are advised to use Julia ≤ 1.2. Also no work required. Rewriting your code to call C directly is not really an option as it will not recover the performance of Julia 1.2 since C has the same issue as Julia 1.3 for the same reasons. Implementing a simple unlocked I/O type that supports just the operations you need would be straightforward (as already demonstrated), and would allow you to support 1.3 without a performance loss and with minimal work. If you're willing to do more work, you can get even better performance then you previously had by changing your code to not do byte-oriented I/O. So you have two zero-work options, a minimal-work option with the same performance, and significant-work option that would give even better performance.

I'm not entirely sure either why he jumped straight to testing a reimplementation of the existing IOBuffer type (although that is a relatively complete implementation of an IO object). That code should behave about the same as the pre-written one

To clarify, my work in Parsers.jl wasn't a simple "jump straight to reimplementing a Base structure", but a pretty involved, step-by-step benchmarked journey of maximizing the performance of parsing integers/floats. The issue is that Base.IOBuffer doesn't allow for well-optimized byte-by-byte parsing workflows, for the following reasons:

  • every read(::IOBuffer, UInt8) includes:
  • those extra checks are costly when doing byte-by-byte, since eof needs to be checked and handled appropriately per byte anyway, to select specific parsing behaviors. So when doing byte-by-byte, I'm essentially doing my own eof checks, with no way to avoid them using IOBuffer (unless I write my own readbyte function, which I did for a while)
  • There are also extra costs from all the loads/stores of using IOBuffer in the first place; is it more convenient? Of course. But when the initial input is a mmapped Vector{UInt8} or String anyway, having specific variables for pos and buffer len cuts down on _a lot_ of extra loads/stores; in read(::IOBuffer, UInt8) there are 5 loads/stores alone, and eof has 2, for example.

This is just to clarify that "just using Base.IOBuffer" isn't actually the best solution currently when doing the lowest-level, byte-by-byte parsing workflows that aim to maximize performance. I remember throwing around the idea of having some of those extra checks use @checkbounds perhaps, so you could do @inbounds read(io, UInt8) to avoid them, but never got around to trying anything out, and wasn't sure if I'd be able to avoid the extra loads/stores anyway, so just dropped down to plain byte vectors and manual position/size tracking.

I'm not entirely sure either why he jumped straight to testing a reimplementation of the existing IOBuffer type (although that is a relatively complete implementation of an IO object). That code should behave about the same as the pre-written one:

using Mmap
MmapReader(file::String) = IOBuffer(Mmap.mmap(file))

Oh, I tried that because it was exactly what @KristofferC suggested. You're right that your suggestion is simpler; it might also provide a speed improvement, still testing. If so, I prefer this approach because I need to keep syntax relatively simple; the planned work involves many researchers spanning a wide range of programming experiences.

Your issue is the first report we've gotten about someone doing byte-oriented I/O where this slowdown has been a problem.

We did get another report about this: https://github.com/JuliaLang/julia/pull/32421#issuecomment-535496885, involving WAV.jl. However, as expected that code was already so sub-optimal that we ended up hugely speeding it up independent of the IOStream locks.

Hi everyone,

I want to review the solutions proposed here and the problems I'm having adopting them.

Mmap: the arguments for using this would be greatly strengthened by explanations of how it handles SIGBUS/SIGSEGV and associated risks. For example, what happens in Julia if there's a connection failure during memory-mapped file I/O? How does Julia handle issues like this? I see that Mmap.mmap() in Linux wraps to C mmap via jl_mmap in sys.c; what signal handlers are present? If none, then what are the risks? The chapter in the Julia manual comprises three docstrings and isn't referenced in any other chapter, so any explanations would be reassuring.

BufferedStreams.jl: with apologies, this isn't a viable solution for us. BufferedInputStream() leads to worse performance than 1.2.0 byte-wise reads with every data format tested; best case is ~50% slower. That's untenable when analyzing GB of data. A small buffer relative to file size increases read time; a large buffer increases memory overhead. The additional code needed to adaptively adjust buffer size is a Herculean labor because each data format has its own trade-off curve, which changes when "optional" sections are present in a file.

C wrappers: To be clear, I don't like C wrappers any more than you do, but they appear to safely restore read speeds to 1.2 levels. I'm pretty surprised that the opinions of wrappers are so overwhelmingly negative; my concern is merely that if I tell a colleague to make read calls with C wrappers for performance, I'm going to get a reply like "if all you're doing is wrapping to C, why use Julia?". I can still make a compelling argument based on data processing performance.

Use Julia ≤v1.2.0: I certainly _can_ tell colleagues to do this, but in your opinion, _should_ I? The only way I can convince people to adopt the Julia language is to demonstrate long-term viability; many colleagues have been coding in Fortran since the early 80s. (This sort of ... sociological inertia, I guess you'd call it ... has been very hard to overcome in the past.) I can make a strong case for switching if I show that Julia's performance isn't version-dependent. But that begs the question: what I can I safely tell colleagues about your plans for read() and thread locking?

Finally: are there other solutions that I missed?

PS HDF5 #609 was found during my tests but appears unrelated.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tkoolen picture tkoolen  Â·  3Comments

i-apellaniz picture i-apellaniz  Â·  3Comments

yurivish picture yurivish  Â·  3Comments

sbromberger picture sbromberger  Â·  3Comments

Keno picture Keno  Â·  3Comments