Julia: Reading very large text files

Created on 28 Oct 2016  ยท  21Comments  ยท  Source: JuliaLang/julia

Apologies if this is covered previously, but I am unable to find an answer after extensive reading. I did find one reference, discussed below.

Let's say I am reading the human genome chromosome 1, approx 0.25GB unzipped. The zipped file is available here
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz

In this file the first line is a "header" which I don't care about, and the rest is the sequence. I want the output to be one long string representing the sequence, concatenating all lines after throwing away the first line and removing trailing newline characters. A simple-minded Python function that does this is as follows. On my machine it takes less than 700ms to run.

def readfasta(filename):
    f = open(filename)
    l0 = f.readline()
    l0 = f.readline()[0:-1]
    for l in f:
        l0 += l[0:-1]
    return l0

A direct translation to Julia takes -- well, it doesn't seem to finish in any reasonable time, but it takes 11 seconds on a file 1/200 the size and the time seems to increase more than linearly with file size.

I found this post apparently addressing the matter: https://groups.google.com/forum/#!topic/julia-dev/UDllYRfm64w
The OP said reading the file into an array of strings helped. I tried that, and the following runs in 20 seconds -- still about 30 times slower than python.

function readfasta(filename::String)
    f = open(filename)
    l0 = readline(f)
    strarr = map(x::String->x[1:end-1], readlines(f))
    return join(strarr,"")
end

A comment on that post suggested using IOBuffer, but I am not clear on how to do this (joining strings after dropping the last newline characters) more efficiently than what is being done above. Any help would be very welcome.

O needs more info performance

Most helpful comment

Can confirm that putting IO as a type parameter on EachLine solves inferrability issues:

Body::Nothing
2 1 โ”€โ”€ %1  = invoke Main.open(%%filename::String, "r"::String)::IOStream   โ”‚
4 โ”‚          invoke Core.kwfunc(Main.eachline::Any)                        โ”‚
  โ”‚    %3  = Base.slt_int(0, 1)::Bool                                      โ”‚โ•ปโ•ทโ•ทโ•ทโ•ท   #eachline
  โ””โ”€โ”€โ”€       goto 3 if not %3                                              โ”‚โ”‚โ”ƒโ”‚โ”‚โ”‚    isempty
  2 โ”€โ”€       goto 4                                                        โ”‚โ”‚โ”‚โ”ƒโ”‚โ”‚     iterate
  3 โ”€โ”€       invoke Base.getindex(()::Tuple{}, 1::Int64)                   โ”‚โ”‚โ”‚โ”‚โ”ƒโ”‚      iterate
  โ””โ”€โ”€โ”€       unreachable                                                   โ”‚โ”‚โ”‚โ”‚โ”‚โ”ƒ       iterate
  4 โ”€โ”€       goto 5                                                        โ”‚โ”‚โ”‚โ”‚โ”‚
  5 โ”€โ”€       goto 6                                                        โ”‚โ”‚โ”‚โ•ป       iterate
  6 โ”€โ”€       goto 7                                                        โ”‚โ”‚โ”‚
  7 โ”€โ”€       invoke Core.kwfunc(Base.EachLine::Any)                        โ”‚โ”‚โ•ป       #eachline#373
  โ”‚    %12 = Base.slt_int(0, 1)::Bool                                      โ”‚โ”‚โ”‚โ•ปโ•ทโ•ทโ•ทโ•ท   Type
  โ””โ”€โ”€โ”€       goto 9 if not %12                                             โ”‚โ”‚โ”‚โ”‚โ”ƒโ”‚โ”‚โ”‚    isempty
  8 โ”€โ”€       goto 10                                                       โ”‚โ”‚โ”‚โ”‚โ”‚โ”ƒโ”‚โ”‚     iterate
  9 โ”€โ”€       invoke Base.getindex(()::Tuple{}, 1::Int64)                   โ”‚โ”‚โ”‚โ”‚โ”‚โ”‚โ”ƒโ”‚      iterate
  โ””โ”€โ”€โ”€       unreachable                                                   โ”‚โ”‚โ”‚โ”‚โ”‚โ”‚โ”‚โ”ƒ       iterate
  10 โ”€       goto 11                                                       โ”‚โ”‚โ”‚โ”‚โ”‚โ”‚โ”‚
  11 โ”€       goto 12                                                       โ”‚โ”‚โ”‚โ”‚โ”‚โ•ป       iterate
  12 โ”€       goto 13                                                       โ”‚โ”‚โ”‚โ”‚โ”‚
  13 โ”€ %20 = new(Base.EachLine{IOStream}, %1, getfield(Base, Symbol("##370#372"))(), true)::Base.EachLine{IOStream}
  โ””โ”€โ”€โ”€       goto 14                                                       โ”‚โ”‚โ”‚โ”‚
  14 โ”€       goto 15                                                       โ”‚โ”‚โ”‚
  15 โ”€       goto 16                                                       โ”‚โ”‚
  16 โ”€ %24 = :(Base.iterate)::Core.Compiler.Const(iterate, false)          โ”‚
  โ”‚    %25 = invoke %24(%20::Base.EachLine{IOStream}, Base.nothing::Nothing)::Union{Nothing, Tuple{String,Nothing}}
  โ”‚    %26 = Core.:(===)(%25, nothing)::Bool                               โ”‚
  โ”‚    %27 = Base.not_int(%26)::Bool                                       โ”‚
  โ””โ”€โ”€โ”€       goto 19 if not %27                                            โ”‚
  17 โ”„ %29 = ฯ† (16 => 0, 18 => %35)::Int64                                 โ”‚
  โ”‚    %30 = ฯ† (16 => %25, 18 => %36)::Union{Nothing, Tuple{String,Nothing}}
  โ”‚    %31 = ฯ€ (%30, Tuple{String,Nothing})                                โ”‚
  โ”‚    %32 = Core.getfield(%31, 1)::String                                 โ”‚
  โ”‚    %33 = Core.getfield(%31, 2)::Core.Compiler.Const(nothing, false)    โ”‚
5 โ”‚    %34 = invoke Main.length(%32::String)::Int64                        โ”‚
  โ”‚    %35 = Base.add_int(%29, %34)::Int64                                 โ”‚โ•ป       +
  โ”‚    %36 = invoke Base.iterate(%20::Base.EachLine{IOStream}, %33::Nothing)::Union{Nothing, Tuple{String,Nothing}}
  โ”‚    %37 = Core.:(===)(%36, nothing)::Bool                               โ”‚
  โ”‚    %38 = Base.not_int(%37)::Bool                                       โ”‚
  โ””โ”€โ”€โ”€       goto 19 if not %38                                            โ”‚
  18 โ”€       goto 17                                                       โ”‚
8 19 โ”„ %41 = Base.getfield(%1, :ios)::Array{UInt8,1}                       โ”‚โ•ปโ•ท      close
  โ”‚    %42 = :($(Expr(:foreigncall, :(:jl_array_ptr), Ptr{UInt8}, svec(Any), :(:ccall), 1, Core.SSAValue(41))))::Ptr{UInt8}
  โ”‚    %43 = Base.bitcast(Ptr{Nothing}, %42)::Ptr{Nothing}                 โ”‚โ”‚โ”‚โ•ป       convert
  โ”‚    %44 = :($(Expr(:foreigncall, :(:ios_close), Nothing, svec(Ptr{Nothing}), :(:ccall), 1, Core.SSAValue(43), Core.SSAValue(41))))::Nothing
  โ””โ”€โ”€โ”€       return %44 

and improves performance:

Without:

  0.282460 seconds (1.64 M allocations: 57.981 MiB, 1.74% gc time)

With:

0.064163 seconds (1.09 M allocations: 49.241 MiB, 5.91% gc time)

All 21 comments

I think issue is not the best place to ask questions. You can ask at the julia-users group or stackoverflow. BTW, if you'd like to load sequences in a FASTA file, Bio.jl has a FASTA parser.

I get ~0.8 seconds in python and ~5 seconds in julia, so it seems like there is a performance issue to fix here. I'm not sure why you're seeing 30x slower; what version of julia are you using?

Thanks for the reopen. This is julia 0.5.0 via Ubuntu PPA (staticfloat).

Did you get 5 secs with exactly the julia code I posted? I get 16-24 secs but generally about 20 secs each time I try. Consistently about 600-700ms with the python code I posted.

Answering @bicycle1885 above. Using Bio.seq,

@time for record in open(FASTAReader,"chr1.fa")
           print(length(record.seq))
end

takes 19 seconds to run, similar to my code.

Your Julia seems to be much slower than mine (note that the first call includes JIT compiling time):

julia> @time open(collect, FASTAReader, "chr1.fa");
  2.541710 seconds (1.34 M allocations: 548.084 MB, 9.27% gc time)

julia> @time open(collect, FASTAReader, "chr1.fa");
  1.478964 seconds (81 allocations: 494.553 MB, 15.05% gc time)

Also note that what your Python code does is just reading and concatenating strings while Bio.jl does checking validity and encoding data into BioSequence objects.

As a reference, I ran your Python code on my computer (Python 3.4.1):

In [5]: %time s = readfasta("chr1.fa")
CPU times: user 2.18 s, sys: 199 ms, total: 2.38 s
Wall time: 2.41 s

and your Julia code (Julia 0.5.0):

julia> @time readfasta("chr1.fa");
  4.506265 seconds (19.94 M allocations: 1.942 GB, 23.07% gc time)

I get

julia> @time open(collect, FASTAReader, "chr1.fa");
 20.262316 seconds (25.87 M allocations: 690.482 MB, 0.84% gc time)

julia> @time open(collect, FASTAReader, "chr1.fa");
 21.800200 seconds (24.87 M allocations: 874.410 MB, 1.03% gc time)

julia> @time readfasta("chr1.fa");
 22.814977 seconds (19.94 M allocations: 1.978 GB, 4.46% gc time)

Note the vastly bigger number of allocations for the Bio.jl code compared to what you see. But for my readfasta function there is no major difference in allocations. This is with a fresh update of julia to the latest in the PPA (still 0.5.0), and after a Pkg.update(). Should I be trying a Julia install from github directly?

I installed julia 0.5.1-pre+4 (2016-10-28 19:39 UTC) from github. Pretty much the same timings as above. Baffled why @JeffBezanson sees faster-but-still-slow performance while @bicycle1885 sees performance comparable to python. Can it be something OS-related? This is Ubuntu Xenial, Intel i7-4770 CPU @ 3.40GHz, 32GB RAM. ZFS filesystem, but I tried reading from an ext4 filesystem and the results are the same.

On my laptop (i5-5200U CPU @2.20 GHz, 4GB RAM, Ubuntu Xenial, ZFS) both of these are much faster: the readfasta function takes 6-16 seconds (very variable, and later calls aren't always faster) and the FASTAReader call takes about 5-6 seconds. I don't understand the difference, since it's a slower machine with less RAM running the same OS. Even so, python 2.7.12 takes 700ms on the function I posted.

If I translate my julia function to python (read a list of lines, truncate them, join them) it takes about 10 seconds, not too different from julia. But if I translate my python function to julia it takes a very very long time apparently because concatenating strings is slow in julia.

This is very mysterious: after a reboot (previous uptime had been about 2 months) the exact same readfasta function as above takes 1.8 seconds, 10x faster than before. The Bio.jl function takes 1.8 seconds too, again 10x faster. The python function takes 0.6 seconds, almost unchanged from earlier.

So I am wondering if it is an OS issue, but if so, why was it affecting julia and not python (or, as far as I can tell, any other software that read/wrote large files, such as bedtools)?

@rsidd what happens when you time the result after running gc_enable(false)? I wonder if the performance problem you are seeing has been visited before in #10428

With gc_enable(false) the readfasta time goes down from avg 1.6-1.8 seconds to 1.1-1.2 seconds. If performance slows in the future to the extent I saw previously, I will give it a try and report back.

However, right now with gc_enable(true) I am getting 25% gc time so the speedup I see without gc is reasonable. In the extremely slow numbers I posted earlier, it was 4% gc time.

it takes a very very long time apparently because concatenating strings is slow in julia

Well, in that case it's because that code in julia expresses an O(n^2) algorithm.

If you are using laptops, be sure to be consistent in the power mode of the processor.. in my case I took months to realise that ubuntu wasn't updating to full power mode when power was plugged back, but only after a hibernation/resume sequence (and that's was a 10-20x performance difference).

Hi. Is julia using C strings internally to concatenate the data ?

It looks a lot like the use of C strings for concatenation of a large amount of text. Since those string are zero terminated, they behave a single linked list and concatenation is O(n) when using strcat.

Then doing that in a while loop (concatenating characters one by one) is making it effectively O(n^2) which would explain the gradual degradation in performance.

Actually, it is the sum of 1+2+3+4+5+... = sum(1..n) = n*(n+1)/2 iterations.

I met this issue in C and Delphi in the past. A workaround solution was to use a temporary (on the C stack) local variable as a buffer to concatenate say up to 100 characters and then to concatenate that back to the main stream.

Another much better solution is to use two pointers (start/end of string) and pointer arithmetic for insertion at least and that yields O(1).

Julia does not use C strings. But the algorithm is still O(n^2) with Julia's string implementation, which is what @JeffBezanson wrote above. The solution is fairly simple โ€“ use the sprint function and write to the io object it provides, e.g.

function readfasta(filename::String)
    f = open(filename)
    l0 = readline(f)
    return sprint() do io
        for line in eachline(f)
            print(io, line[1:end-1])
        end
    end
end

When this issue was opened, the default behavior of eachline and readline was to provide lines including the newline, but that's since been changed to automatically remove the newline unless you pass the chomp=false keyword argument, so the new version of this code would simply be:

function readfasta(filename::String)
    f = open(filename)
    l0 = readline(f)
    return sprint() do io
        for line in eachline(f)
            print(io, line)
        end
    end
end

The 0.6 version of Julia also included a much faster String implementation, so it would be great to get some feedback from the OP and others in this thread about whether they still see a performance issue or if this has been addressed now and can be closed.

I found similar problems with the performance of functions that use eachline to iterate over the lines of a large file using Julia 0.6.4 and 0.7.0-beta2.81. Python is a lot faster than Julia in the following toy example. I think that part of the problem is that the eachline function is type unstable.

Example

Input

I've used this large file for the example: https://raw.githubusercontent.com/diegozea/mitos-benchmarks/master/data/PF00089.fasta

I do not perform string concatenation as in the previous example. I'm only using length to count the number of characters in each line (in order to do something inside the loop).

Julia 0.6.4
function countchars(filename)
    open(filename, "r") do fh
        c = 0
        for line in eachline(fh, chomp=false)
            c += length(line)
        end
        c
    end
end
Julia 0.7.0-beta2.81
function countchars(filename)
    open(filename, "r") do fh
        c = 0
        for line in eachline(fh, keep=true)
            c += length(line)
        end
        c
    end
end
Python 2.7.12 & Python 3.5.2
def countchars(filename):
    with open(filename, "r") as fh:
        c = 0
        for line in fh:
            c += len(line)
        return c

Timing

(after compilation in Julia)

Julia 0.6.4
julia> @time countchars("PF00089.fasta")
  0.220063 seconds (1.63 M allocations: 99.055 MiB, 4.79% gc time)
32301307
Julia 0.7.0-beta2.81
julia> @time countchars("PF00089.fasta")
  0.421870 seconds (1.63 M allocations: 57.543 MiB, 0.99% gc time)
32301307
Python 2.7.12
In [3]: %time countchars("PF00089.fasta")
CPU times: user 92.2 ms, sys: 12.3 ms, total: 104 ms
Wall time: 103 ms
Out[3]: 32301307
Python 3.5.2
In [3]: %time countchars("PF00089.fasta")
CPU times: user 154 ms, sys: 32.7 ms, total: 187 ms
Wall time: 174 ms
Out[3]: 32301307

Type instability of eachline

Using Traceur and code_warntype we can see that Julia 0.6.4 can't infer the returned type of eachline:

julia> @code_warntype countchars("PF00089.fasta")
Variables:
  #self# <optimized out>
  filename::String
  #5::##5#6

Body:
  begin 
      #5::##5#6 = $(Expr(:new, :(Main.##5#6)))
      SSAValue(2) = "r"
      return $(Expr(:invoke, MethodInstance for open(::##5#6, ::String, ::String), :(Main.open), :(#5), :(filename), SSAValue(2)))
  end::Any

julia> @trace countchars("PF00089.fasta")
(Core.kwfunc)(::Base.#eachline) at boot.jl:237
  returns Any
(Core.kwfunc)(::Type{EachLine}) at boot.jl:237
  returns Any
(Type)(::Array{Any,1}, ::Type{EachLine}, ::IOStream) at :0
  #temp# is assigned as Any at line 0
  ondone is assigned as Function at line 0
  ondone is assigned as Base.##208#211 at line 0
  dynamic dispatch to ((Core.getfield)($(QuoteNode(Core.Box(Base.#call#206))), :contents))(ondone, chomp, , stream) at line 0
  returns Any
(Base.#eachline)(::Array{Any,1}, ::Base.#eachline, ::IOStream) at :0
  #temp# is assigned as Any at line 0
(done)(::EachLine, ::Void) at io.jl:574
  dynamic dispatch to (Base.eof)((Core.getfield)(itr, :stream)) at line 574
(Core.kwfunc)(::Base.#readline) at boot.jl:237
  returns Any
(Base.#readline)(::Array{Any,1}, ::Base.#readline, ::IOStream) at :0
  #temp# is assigned as Any at line 0
(next)(::EachLine, ::Void) at io.jl:578
  dynamic dispatch to ($(QuoteNode(Base.#readline)))($(Expr(:invoke, MethodInstance for vector_any(::Any, ::Vararg{Any,N} where N), :(Base.vector_any), :(:chomp), :((Core.getfield)(itr, :chomp)))), Base.readline, (Core.getfield)(itr, :stream)) at line 578
  returns Tuple{Any,Void}
(#5)(::IOStream) at REPL[6]:3
  line is assigned as Any at line 4
  c is assigned as Int64 at line 3
  c is assigned as Any at line 5
  dynamic dispatch to (Base.eof)((Core.getfield)($(Expr(:invoke, MethodInstance for (::Base.#kw##eachline)(::Array{Any,1}, ::Base.#eachline, ::IOStream), :($(QuoteNode(Base.#eachline))), :($(Expr(:invoke, MethodInstance for vector_any(::Any, ::Vararg{Any,N} where N), :(Base.vector_any), :(:chomp), false))), :(Main.eachline), :fh)), :stream)) at line 4
  dynamic dispatch to ($(QuoteNode(Base.#readline)))($(Expr(:invoke, MethodInstance for vector_any(::Any, ::Vararg{Any,N} where N), :(Base.vector_any), :(:chomp), :((Core.getfield)($(Expr(:invoke, MethodInstance for (::Base.#kw##eachline)(::Array{Any,1}, ::Base.#eachline, ::IOStream), :($(QuoteNode(Base.#eachline))), :($(Expr(:invoke, MethodInstance for vector_any(::Any, ::Vararg{Any,N} where N), :(Base.vector_any), :(:chomp), false))), :(Main.eachline), :fh)), :chomp)))), Base.readline, (Core.getfield)($(Expr(:invoke, MethodInstance for (::Base.#kw##eachline)(::Array{Any,1}, ::Base.#eachline, ::IOStream), :($(QuoteNode(Base.#eachline))), :($(Expr(:invoke, MethodInstance for vector_any(::Any, ::Vararg{Any,N} where N), :(Base.vector_any), :(:chomp), false))), :(Main.eachline), :fh)), :stream)) at line 576
  dynamic dispatch to (Main.length)(line) at line 5
  dynamic dispatch to c + (Main.length)(line) at line 5
  returns Any

Julia's profiler also shows a lot of time being expended in readline and vector_any:

julia> function countchars(filename)
           open(filename, "r") do fh
               c = 0
               for line in eachline(fh, chomp=false)
                   c += length(line)
               end
               c
           end
       end
countchars (generic function with 1 method)

```julia
julia> Profile.print()
313 ./event.jl:73; (::Base.REPL.##1#2{Base.REPL.REPLBackend})()
313 ./REPL.jl:97; macro expansion
313 ./REPL.jl:66; eval_user_input(::Any, ::Base.REPL.REPLBackend)
313 ./boot.jl:235; eval(::Module, ::Any)
313 ./:?; anonymous
313 ./profile.jl:23; macro expansion
313 ./iostream.jl:152; open(::##1#2, ::String, ::String)
212 ./REPL[1]:4; (::##1#2)(::IOStream)
86 ./:0; (::Base.#kw##readline)(::Array{Any,1}, ::Base.#readline, ::IOStream)
1 ./iostream.jl:0; #readline#216(::Bool, ::Function, ::IOStream)
61 ./iostream.jl:234; #readline#216(::Bool, ::Function, ::IOStream)
1 ./essentials.jl:0; vector_any(::Any, ::Vararg{Any,N} where N)
72 ./essentials.jl:331; vector_any(::Any, ::Vararg{Any,N} where N)
5 ./essentials.jl:333; vector_any(::Any, ::Vararg{Any,N} where N)
1 ./essentials.jl:335; vector_any(::Any, ::Vararg{Any,N} where N)
19 ./io.jl:574; done
3 ./iostream.jl:74; eof(::IOStream)
100 ./REPL[1]:5; (::##1#2)(::IOStream)
6 ./strings/string.jl:164; length(::String)
33 ./strings/string.jl:165; length(::String)

Using `code_warntype` **in Julia 0.7.0-beta2.81** gives a similar output, so maybe it's the same problem in both versions of Julia:
```julia
julia> @code_warntype countchars("PF00089.fasta")
Body::Any
2 1 โ”€ %1 = invoke Base.:(#open#304)(:($(QuoteNode(Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}())))::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, open::Function, getfield(Main, Symbol("##7#8"))()::getfield(Main, Symbol("##7#8")), %%filename::String, "r"::Vararg{String,N} where N)::Any
  โ””โ”€โ”€      return %1                                                                                                                                                                                                                โ”‚ 

I hope this information helps to solve the performance issue,

Best regards,

The stream filed of the EachLine iterator has type IO:
https://github.com/JuliaLang/julia/blob/cbb643366937a85579cb5291c7b835a8e22ce7cb/base/io.jl#L845-L846
But:

julia> code_warntype(readline, Tuple{IO})
Body::String
429 1 โ”€      goto 2 if not false                                                                      โ”‚โ•ป       #readline#304
    2 โ”„ %2 = Base.getfield(%%s, :ios)::Array{UInt8,1}                                                 โ”‚โ”‚โ•ป       getproperty
    โ”‚        Base.sle_int(0, 167772160)                                                               โ”‚โ”‚โ”‚โ•ปโ•ทโ•ทโ•ทโ•ท   convert
    โ”‚        Base.ifelse(true, 10, 0)                                                                 โ”‚โ”‚โ”‚โ”‚โ•ป       Type
    โ”‚   %5 = ฯ€ (false, Bool)                                                                          โ”‚โ”‚      
    โ””โ”€โ”€      goto 3 if not %5                                                                         โ”‚โ”‚      
    3 โ”„ %7 = :($(Expr(:foreigncall, :(:jl_array_ptr), Ptr{UInt8}, svec(Any), :(:ccall), 1, Core.SSAValue(2))))::Ptr{UInt8}nvert
    โ”‚   %8 = Base.bitcast(Ptr{Nothing}, %7)::Ptr{Nothing}                                             โ”‚โ”‚โ”‚โ•ป       convert
    โ”‚   %9 = :($(Expr(:foreigncall, :(:jl_readuntil), Ref{String}, svec(Ptr{Nothing}, UInt8, UInt8, UInt8), :(:ccall), 4, Core.SSAValue(8), 0x0a, 0x01, 0x02, 0x02, 0x01, 0x0a, Core.SSAValue(2))))::String
    โ””โ”€โ”€      goto 4                                                                                   โ”‚โ”‚      
    4 โ”€      return %9                                                                                โ”‚       
Body::Any
369 1 โ”€ %1 = Base.:(#readline#276)(Base.nothing, false, %%#self#, %%s)::Any                                                   โ”‚
    โ””โ”€โ”€      return %1 

The second method that cannot be inferred here is this one:
https://github.com/JuliaLang/julia/blob/cbb643366937a85579cb5291c7b835a8e22ce7cb/base/io.jl#L368-L382
The problem here seems to be that two of the readuntil methods (readuntil(io::Base.AbstractPipe, arg::UInt8; kw...) in Base at io.jl:232 and readuntil(this::Base.LibuvStream, c::UInt8; keep) in Base at stream.jl:769) are inferred as Any, and there are too many methods for String(::Any) for inference.
Possible solutions (not mutually exclusive):

  • Provide a type parameter to EachLine for the type of stream.
  • Add ::String type assertions to the String(...) calls.
  • Dig into the problematic readuntil methods and make them inferable (assuming they are actually type-stable).

Can confirm that putting IO as a type parameter on EachLine solves inferrability issues:

Body::Nothing
2 1 โ”€โ”€ %1  = invoke Main.open(%%filename::String, "r"::String)::IOStream   โ”‚
4 โ”‚          invoke Core.kwfunc(Main.eachline::Any)                        โ”‚
  โ”‚    %3  = Base.slt_int(0, 1)::Bool                                      โ”‚โ•ปโ•ทโ•ทโ•ทโ•ท   #eachline
  โ””โ”€โ”€โ”€       goto 3 if not %3                                              โ”‚โ”‚โ”ƒโ”‚โ”‚โ”‚    isempty
  2 โ”€โ”€       goto 4                                                        โ”‚โ”‚โ”‚โ”ƒโ”‚โ”‚     iterate
  3 โ”€โ”€       invoke Base.getindex(()::Tuple{}, 1::Int64)                   โ”‚โ”‚โ”‚โ”‚โ”ƒโ”‚      iterate
  โ””โ”€โ”€โ”€       unreachable                                                   โ”‚โ”‚โ”‚โ”‚โ”‚โ”ƒ       iterate
  4 โ”€โ”€       goto 5                                                        โ”‚โ”‚โ”‚โ”‚โ”‚
  5 โ”€โ”€       goto 6                                                        โ”‚โ”‚โ”‚โ•ป       iterate
  6 โ”€โ”€       goto 7                                                        โ”‚โ”‚โ”‚
  7 โ”€โ”€       invoke Core.kwfunc(Base.EachLine::Any)                        โ”‚โ”‚โ•ป       #eachline#373
  โ”‚    %12 = Base.slt_int(0, 1)::Bool                                      โ”‚โ”‚โ”‚โ•ปโ•ทโ•ทโ•ทโ•ท   Type
  โ””โ”€โ”€โ”€       goto 9 if not %12                                             โ”‚โ”‚โ”‚โ”‚โ”ƒโ”‚โ”‚โ”‚    isempty
  8 โ”€โ”€       goto 10                                                       โ”‚โ”‚โ”‚โ”‚โ”‚โ”ƒโ”‚โ”‚     iterate
  9 โ”€โ”€       invoke Base.getindex(()::Tuple{}, 1::Int64)                   โ”‚โ”‚โ”‚โ”‚โ”‚โ”‚โ”ƒโ”‚      iterate
  โ””โ”€โ”€โ”€       unreachable                                                   โ”‚โ”‚โ”‚โ”‚โ”‚โ”‚โ”‚โ”ƒ       iterate
  10 โ”€       goto 11                                                       โ”‚โ”‚โ”‚โ”‚โ”‚โ”‚โ”‚
  11 โ”€       goto 12                                                       โ”‚โ”‚โ”‚โ”‚โ”‚โ•ป       iterate
  12 โ”€       goto 13                                                       โ”‚โ”‚โ”‚โ”‚โ”‚
  13 โ”€ %20 = new(Base.EachLine{IOStream}, %1, getfield(Base, Symbol("##370#372"))(), true)::Base.EachLine{IOStream}
  โ””โ”€โ”€โ”€       goto 14                                                       โ”‚โ”‚โ”‚โ”‚
  14 โ”€       goto 15                                                       โ”‚โ”‚โ”‚
  15 โ”€       goto 16                                                       โ”‚โ”‚
  16 โ”€ %24 = :(Base.iterate)::Core.Compiler.Const(iterate, false)          โ”‚
  โ”‚    %25 = invoke %24(%20::Base.EachLine{IOStream}, Base.nothing::Nothing)::Union{Nothing, Tuple{String,Nothing}}
  โ”‚    %26 = Core.:(===)(%25, nothing)::Bool                               โ”‚
  โ”‚    %27 = Base.not_int(%26)::Bool                                       โ”‚
  โ””โ”€โ”€โ”€       goto 19 if not %27                                            โ”‚
  17 โ”„ %29 = ฯ† (16 => 0, 18 => %35)::Int64                                 โ”‚
  โ”‚    %30 = ฯ† (16 => %25, 18 => %36)::Union{Nothing, Tuple{String,Nothing}}
  โ”‚    %31 = ฯ€ (%30, Tuple{String,Nothing})                                โ”‚
  โ”‚    %32 = Core.getfield(%31, 1)::String                                 โ”‚
  โ”‚    %33 = Core.getfield(%31, 2)::Core.Compiler.Const(nothing, false)    โ”‚
5 โ”‚    %34 = invoke Main.length(%32::String)::Int64                        โ”‚
  โ”‚    %35 = Base.add_int(%29, %34)::Int64                                 โ”‚โ•ป       +
  โ”‚    %36 = invoke Base.iterate(%20::Base.EachLine{IOStream}, %33::Nothing)::Union{Nothing, Tuple{String,Nothing}}
  โ”‚    %37 = Core.:(===)(%36, nothing)::Bool                               โ”‚
  โ”‚    %38 = Base.not_int(%37)::Bool                                       โ”‚
  โ””โ”€โ”€โ”€       goto 19 if not %38                                            โ”‚
  18 โ”€       goto 17                                                       โ”‚
8 19 โ”„ %41 = Base.getfield(%1, :ios)::Array{UInt8,1}                       โ”‚โ•ปโ•ท      close
  โ”‚    %42 = :($(Expr(:foreigncall, :(:jl_array_ptr), Ptr{UInt8}, svec(Any), :(:ccall), 1, Core.SSAValue(41))))::Ptr{UInt8}
  โ”‚    %43 = Base.bitcast(Ptr{Nothing}, %42)::Ptr{Nothing}                 โ”‚โ”‚โ”‚โ•ป       convert
  โ”‚    %44 = :($(Expr(:foreigncall, :(:ios_close), Nothing, svec(Ptr{Nothing}), :(:ccall), 1, Core.SSAValue(43), Core.SSAValue(41))))::Nothing
  โ””โ”€โ”€โ”€       return %44 

and improves performance:

Without:

  0.282460 seconds (1.64 M allocations: 57.981 MiB, 1.74% gc time)

With:

0.064163 seconds (1.09 M allocations: 49.241 MiB, 5.91% gc time)

Something seems broken in codegen, since even if we annotate the return type, @profile seems to indicate it's still spending a large percentage of its time at runtime trying to figure out the return type of (Core.tuple)(%48, Base.nothing)::Tuple{String,Nothing} (from code_typed(iterate, (Base.EachLine, Nothing)))

With https://github.com/JuliaLang/julia/pull/28253 (Julia 0.7.0-beta2.126) countchars("PF00089.fasta") takes 0.108174 seconds in my computer instead of 0.405204 seconds in Julia 0.7.0-beta2.8. So, now Julia is as fast as Python 2.7.12 (0.100 s) and faster than Python 3.5.2 (0.168 s).
I think the reason why we are comparable with Python 2.7.12 in this benchmark is that we are measuring the time spent in lenght(::String) (https://discourse.julialang.org/t/performance-of-length-string/12672):

julia> Profile.@profile countchars("PF00089.fasta")
32301307

julia> Profile.print()
2   ./io.jl:896; iterate(::Base.EachLine{IOStream}, ::Nothing)
139 ./task.jl:262; (::getfield(REPL, Symbol("##28#29")){REPL.REPLBackend})()
 139 /home/zea/bin/Julia/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:119; macro expansion
  139 /home/zea/bin/Julia/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:87; eval_user_input(::Any, ::REPL.REPLBackend)
   139 ./boot.jl:319; eval(::Module, ::Any)
    139 /home/zea/bin/Julia/usr/share/julia/stdlib/v0.7/Profile/src/Profile.jl:27; top-level scope
     139 ./REPL[1]:2; countchars
      139 ./iostream.jl:367; open
       139 ./iostream.jl:369; #open#304(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::getfield(Main, Symbol("##3#4"...
        139 ./REPL[1]:5; (::getfield(Main, Symbol("##3#4")))(::IOStream)
         4  ./io.jl:896; iterate(::Base.EachLine{IOStream}, ::Nothing)
          4 ./iostream.jl:193; eof
           3 ./pointer.jl:66; unsafe_convert
            3 ./pointer.jl:65; unsafe_convert
         80 ./io.jl:897; iterate(::Base.EachLine{IOStream}, ::Nothing)
          1  ./boot.jl:321; kwfunc(::Any)
          73 ./none:0; #readline
           73 ./iostream.jl:433; #readline#306
         53 ./strings/string.jl:269; length(::String)
          1  ./int.jl:428; length
          13 ./pointer.jl:0; length
          1  ./strings/string.jl:273; length
           1 ./strings/string.jl:88; codeunit
            1 ./gcutils.jl:87; macro expansion
             1 ./pointer.jl:105; unsafe_load
              1 ./pointer.jl:105; unsafe_load
          10 ./strings/string.jl:276; length
           10 ./int.jl:53; +
          18 ./strings/string.jl:277; length
          10 ./strings/string.jl:278; length
           10 ./strings/string.jl:88; codeunit
            10 ./gcutils.jl:87; macro expansion
             10 ./pointer.jl:105; unsafe_load
              10 ./pointer.jl:105; unsafe_load
         1  ./strings/string.jl:276; length(::String)

Counting lines instead of chars to not call length gives:

Julia 0.7.0-beta2.126
julia> function countlines(filename)
           open(filename, "r") do fh
               c = 0
               for line in eachline(fh, keep=true)
                   c += 1
               end
               c
           end
       end
julia> @time countlines("PF00089.fasta")
  0.059442 seconds (1.09 M allocations: 49.240 MiB, 6.86% gc time)
544104
Python
def countlines(filename):
    with open(filename, "r") as fh:
        c = 0
        for line in fh:
            c += 1
        return c
Python 2.7.12
In [2]: %time countlines("PF00089.fasta")
CPU times: user 56.8 ms, sys: 16.5 ms, total: 73.3 ms
Wall time: 72.7 ms
Out[2]: 544104
Python 3.5.2
In [5]: %time countlines("PF00089.fasta")
CPU times: user 123 ms, sys: 4.26 ms, total: 128 ms
Wall time: 126 ms
Out[5]: 544104

So, we are really iterating lines in a file faster than Python :)

In [2]: %time s = readfasta("chr1.fa")
CPU times: user 2.05 s, sys: 235 ms, total: 2.29 s
Wall time: 2.29 s
julia> @time readfasta("chr1.fa");
  2.090155 seconds (14.96 M allocations: 1.269 GiB, 19.81% gc time)

And using a but more idiomatic code

julia> function readfasta2(filename::String)
           io = IOBuffer()
           f = open(filename)
           l0 = readline(f)
           for line in eachline(f)
               print(io, chomp(line))
           end
           return String(take!(io))
        end

julia> @time readfasta2("chr1.fa");
  1.012338 seconds (14.96 M allocations: 865.145 MiB)

I don't think there is anything specific that warrants this issue to be kept open? Feel free to reopen if you disagree.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

TotalVerb picture TotalVerb  ยท  3Comments

i-apellaniz picture i-apellaniz  ยท  3Comments

helgee picture helgee  ยท  3Comments

iamed2 picture iamed2  ยท  3Comments

StefanKarpinski picture StefanKarpinski  ยท  3Comments