Apologies if this is covered previously, but I am unable to find an answer after extensive reading. I did find one reference, discussed below.
Let's say I am reading the human genome chromosome 1, approx 0.25GB unzipped. The zipped file is available here
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz
In this file the first line is a "header" which I don't care about, and the rest is the sequence. I want the output to be one long string representing the sequence, concatenating all lines after throwing away the first line and removing trailing newline characters. A simple-minded Python function that does this is as follows. On my machine it takes less than 700ms to run.
def readfasta(filename):
f = open(filename)
l0 = f.readline()
l0 = f.readline()[0:-1]
for l in f:
l0 += l[0:-1]
return l0
A direct translation to Julia takes -- well, it doesn't seem to finish in any reasonable time, but it takes 11 seconds on a file 1/200 the size and the time seems to increase more than linearly with file size.
I found this post apparently addressing the matter: https://groups.google.com/forum/#!topic/julia-dev/UDllYRfm64w
The OP said reading the file into an array of strings helped. I tried that, and the following runs in 20 seconds -- still about 30 times slower than python.
function readfasta(filename::String)
f = open(filename)
l0 = readline(f)
strarr = map(x::String->x[1:end-1], readlines(f))
return join(strarr,"")
end
A comment on that post suggested using IOBuffer, but I am not clear on how to do this (joining strings after dropping the last newline characters) more efficiently than what is being done above. Any help would be very welcome.
I think issue is not the best place to ask questions. You can ask at the julia-users group or stackoverflow. BTW, if you'd like to load sequences in a FASTA file, Bio.jl has a FASTA parser.
I get ~0.8 seconds in python and ~5 seconds in julia, so it seems like there is a performance issue to fix here. I'm not sure why you're seeing 30x slower; what version of julia are you using?
Thanks for the reopen. This is julia 0.5.0 via Ubuntu PPA (staticfloat).
Did you get 5 secs with exactly the julia code I posted? I get 16-24 secs but generally about 20 secs each time I try. Consistently about 600-700ms with the python code I posted.
Answering @bicycle1885 above. Using Bio.seq,
@time for record in open(FASTAReader,"chr1.fa")
print(length(record.seq))
end
takes 19 seconds to run, similar to my code.
Your Julia seems to be much slower than mine (note that the first call includes JIT compiling time):
julia> @time open(collect, FASTAReader, "chr1.fa");
2.541710 seconds (1.34 M allocations: 548.084 MB, 9.27% gc time)
julia> @time open(collect, FASTAReader, "chr1.fa");
1.478964 seconds (81 allocations: 494.553 MB, 15.05% gc time)
Also note that what your Python code does is just reading and concatenating strings while Bio.jl does checking validity and encoding data into BioSequence
objects.
As a reference, I ran your Python code on my computer (Python 3.4.1):
In [5]: %time s = readfasta("chr1.fa")
CPU times: user 2.18 s, sys: 199 ms, total: 2.38 s
Wall time: 2.41 s
and your Julia code (Julia 0.5.0):
julia> @time readfasta("chr1.fa");
4.506265 seconds (19.94 M allocations: 1.942 GB, 23.07% gc time)
I get
julia> @time open(collect, FASTAReader, "chr1.fa");
20.262316 seconds (25.87 M allocations: 690.482 MB, 0.84% gc time)
julia> @time open(collect, FASTAReader, "chr1.fa");
21.800200 seconds (24.87 M allocations: 874.410 MB, 1.03% gc time)
julia> @time readfasta("chr1.fa");
22.814977 seconds (19.94 M allocations: 1.978 GB, 4.46% gc time)
Note the vastly bigger number of allocations for the Bio.jl code compared to what you see. But for my readfasta function there is no major difference in allocations. This is with a fresh update of julia to the latest in the PPA (still 0.5.0), and after a Pkg.update(). Should I be trying a Julia install from github directly?
I installed julia 0.5.1-pre+4 (2016-10-28 19:39 UTC) from github. Pretty much the same timings as above. Baffled why @JeffBezanson sees faster-but-still-slow performance while @bicycle1885 sees performance comparable to python. Can it be something OS-related? This is Ubuntu Xenial, Intel i7-4770 CPU @ 3.40GHz, 32GB RAM. ZFS filesystem, but I tried reading from an ext4 filesystem and the results are the same.
On my laptop (i5-5200U CPU @2.20 GHz, 4GB RAM, Ubuntu Xenial, ZFS) both of these are much faster: the readfasta function takes 6-16 seconds (very variable, and later calls aren't always faster) and the FASTAReader call takes about 5-6 seconds. I don't understand the difference, since it's a slower machine with less RAM running the same OS. Even so, python 2.7.12 takes 700ms on the function I posted.
If I translate my julia function to python (read a list of lines, truncate them, join them) it takes about 10 seconds, not too different from julia. But if I translate my python function to julia it takes a very very long time apparently because concatenating strings is slow in julia.
This is very mysterious: after a reboot (previous uptime had been about 2 months) the exact same readfasta function as above takes 1.8 seconds, 10x faster than before. The Bio.jl function takes 1.8 seconds too, again 10x faster. The python function takes 0.6 seconds, almost unchanged from earlier.
So I am wondering if it is an OS issue, but if so, why was it affecting julia and not python (or, as far as I can tell, any other software that read/wrote large files, such as bedtools)?
@rsidd what happens when you time the result after running gc_enable(false)
? I wonder if the performance problem you are seeing has been visited before in #10428
With gc_enable(false) the readfasta time goes down from avg 1.6-1.8 seconds to 1.1-1.2 seconds. If performance slows in the future to the extent I saw previously, I will give it a try and report back.
However, right now with gc_enable(true) I am getting 25% gc time so the speedup I see without gc is reasonable. In the extremely slow numbers I posted earlier, it was 4% gc time.
it takes a very very long time apparently because concatenating strings is slow in julia
Well, in that case it's because that code in julia expresses an O(n^2) algorithm.
If you are using laptops, be sure to be consistent in the power mode of the processor.. in my case I took months to realise that ubuntu wasn't updating to full power mode when power was plugged back, but only after a hibernation/resume sequence (and that's was a 10-20x performance difference).
Hi. Is julia using C strings internally to concatenate the data ?
It looks a lot like the use of C strings for concatenation of a large amount of text. Since those string are zero terminated, they behave a single linked list and concatenation is O(n) when using strcat.
Then doing that in a while loop (concatenating characters one by one) is making it effectively O(n^2) which would explain the gradual degradation in performance.
Actually, it is the sum of 1+2+3+4+5+... = sum(1..n) = n*(n+1)/2 iterations.
I met this issue in C and Delphi in the past. A workaround solution was to use a temporary (on the C stack) local variable as a buffer to concatenate say up to 100 characters and then to concatenate that back to the main stream.
Another much better solution is to use two pointers (start/end of string) and pointer arithmetic for insertion at least and that yields O(1).
Julia does not use C strings. But the algorithm is still O(n^2) with Julia's string implementation, which is what @JeffBezanson wrote above. The solution is fairly simple โ use the sprint
function and write to the io
object it provides, e.g.
function readfasta(filename::String)
f = open(filename)
l0 = readline(f)
return sprint() do io
for line in eachline(f)
print(io, line[1:end-1])
end
end
end
When this issue was opened, the default behavior of eachline
and readline
was to provide lines including the newline, but that's since been changed to automatically remove the newline unless you pass the chomp=false
keyword argument, so the new version of this code would simply be:
function readfasta(filename::String)
f = open(filename)
l0 = readline(f)
return sprint() do io
for line in eachline(f)
print(io, line)
end
end
end
The 0.6 version of Julia also included a much faster String
implementation, so it would be great to get some feedback from the OP and others in this thread about whether they still see a performance issue or if this has been addressed now and can be closed.
I found similar problems with the performance of functions that use eachline
to iterate over the lines of a large file using Julia 0.6.4 and 0.7.0-beta2.81. Python is a lot faster than Julia in the following toy example. I think that part of the problem is that the eachline
function is type unstable.
I've used this large file for the example: https://raw.githubusercontent.com/diegozea/mitos-benchmarks/master/data/PF00089.fasta
I do not perform string concatenation as in the previous example. I'm only using length
to count the number of characters in each line (in order to do something inside the loop).
function countchars(filename)
open(filename, "r") do fh
c = 0
for line in eachline(fh, chomp=false)
c += length(line)
end
c
end
end
function countchars(filename)
open(filename, "r") do fh
c = 0
for line in eachline(fh, keep=true)
c += length(line)
end
c
end
end
def countchars(filename):
with open(filename, "r") as fh:
c = 0
for line in fh:
c += len(line)
return c
(after compilation in Julia)
julia> @time countchars("PF00089.fasta")
0.220063 seconds (1.63 M allocations: 99.055 MiB, 4.79% gc time)
32301307
julia> @time countchars("PF00089.fasta")
0.421870 seconds (1.63 M allocations: 57.543 MiB, 0.99% gc time)
32301307
In [3]: %time countchars("PF00089.fasta")
CPU times: user 92.2 ms, sys: 12.3 ms, total: 104 ms
Wall time: 103 ms
Out[3]: 32301307
In [3]: %time countchars("PF00089.fasta")
CPU times: user 154 ms, sys: 32.7 ms, total: 187 ms
Wall time: 174 ms
Out[3]: 32301307
eachline
Using Traceur
and code_warntype
we can see that Julia 0.6.4 can't infer the returned type of eachline
:
julia> @code_warntype countchars("PF00089.fasta")
Variables:
#self# <optimized out>
filename::String
#5::##5#6
Body:
begin
#5::##5#6 = $(Expr(:new, :(Main.##5#6)))
SSAValue(2) = "r"
return $(Expr(:invoke, MethodInstance for open(::##5#6, ::String, ::String), :(Main.open), :(#5), :(filename), SSAValue(2)))
end::Any
julia> @trace countchars("PF00089.fasta")
(Core.kwfunc)(::Base.#eachline) at boot.jl:237
returns Any
(Core.kwfunc)(::Type{EachLine}) at boot.jl:237
returns Any
(Type)(::Array{Any,1}, ::Type{EachLine}, ::IOStream) at :0
#temp# is assigned as Any at line 0
ondone is assigned as Function at line 0
ondone is assigned as Base.##208#211 at line 0
dynamic dispatch to ((Core.getfield)($(QuoteNode(Core.Box(Base.#call#206))), :contents))(ondone, chomp, , stream) at line 0
returns Any
(Base.#eachline)(::Array{Any,1}, ::Base.#eachline, ::IOStream) at :0
#temp# is assigned as Any at line 0
(done)(::EachLine, ::Void) at io.jl:574
dynamic dispatch to (Base.eof)((Core.getfield)(itr, :stream)) at line 574
(Core.kwfunc)(::Base.#readline) at boot.jl:237
returns Any
(Base.#readline)(::Array{Any,1}, ::Base.#readline, ::IOStream) at :0
#temp# is assigned as Any at line 0
(next)(::EachLine, ::Void) at io.jl:578
dynamic dispatch to ($(QuoteNode(Base.#readline)))($(Expr(:invoke, MethodInstance for vector_any(::Any, ::Vararg{Any,N} where N), :(Base.vector_any), :(:chomp), :((Core.getfield)(itr, :chomp)))), Base.readline, (Core.getfield)(itr, :stream)) at line 578
returns Tuple{Any,Void}
(#5)(::IOStream) at REPL[6]:3
line is assigned as Any at line 4
c is assigned as Int64 at line 3
c is assigned as Any at line 5
dynamic dispatch to (Base.eof)((Core.getfield)($(Expr(:invoke, MethodInstance for (::Base.#kw##eachline)(::Array{Any,1}, ::Base.#eachline, ::IOStream), :($(QuoteNode(Base.#eachline))), :($(Expr(:invoke, MethodInstance for vector_any(::Any, ::Vararg{Any,N} where N), :(Base.vector_any), :(:chomp), false))), :(Main.eachline), :fh)), :stream)) at line 4
dynamic dispatch to ($(QuoteNode(Base.#readline)))($(Expr(:invoke, MethodInstance for vector_any(::Any, ::Vararg{Any,N} where N), :(Base.vector_any), :(:chomp), :((Core.getfield)($(Expr(:invoke, MethodInstance for (::Base.#kw##eachline)(::Array{Any,1}, ::Base.#eachline, ::IOStream), :($(QuoteNode(Base.#eachline))), :($(Expr(:invoke, MethodInstance for vector_any(::Any, ::Vararg{Any,N} where N), :(Base.vector_any), :(:chomp), false))), :(Main.eachline), :fh)), :chomp)))), Base.readline, (Core.getfield)($(Expr(:invoke, MethodInstance for (::Base.#kw##eachline)(::Array{Any,1}, ::Base.#eachline, ::IOStream), :($(QuoteNode(Base.#eachline))), :($(Expr(:invoke, MethodInstance for vector_any(::Any, ::Vararg{Any,N} where N), :(Base.vector_any), :(:chomp), false))), :(Main.eachline), :fh)), :stream)) at line 576
dynamic dispatch to (Main.length)(line) at line 5
dynamic dispatch to c + (Main.length)(line) at line 5
returns Any
Julia's profiler also shows a lot of time being expended in readline
and vector_any
:
julia> function countchars(filename)
open(filename, "r") do fh
c = 0
for line in eachline(fh, chomp=false)
c += length(line)
end
c
end
end
countchars (generic function with 1 method)
```julia
julia> Profile.print()
313 ./event.jl:73; (::Base.REPL.##1#2{Base.REPL.REPLBackend})()
313 ./REPL.jl:97; macro expansion
313 ./REPL.jl:66; eval_user_input(::Any, ::Base.REPL.REPLBackend)
313 ./boot.jl:235; eval(::Module, ::Any)
313 ./
313 ./profile.jl:23; macro expansion
313 ./iostream.jl:152; open(::##1#2, ::String, ::String)
212 ./REPL[1]:4; (::##1#2)(::IOStream)
86 ./
1 ./iostream.jl:0; #readline#216(::Bool, ::Function, ::IOStream)
61 ./iostream.jl:234; #readline#216(::Bool, ::Function, ::IOStream)
1 ./essentials.jl:0; vector_any(::Any, ::Vararg{Any,N} where N)
72 ./essentials.jl:331; vector_any(::Any, ::Vararg{Any,N} where N)
5 ./essentials.jl:333; vector_any(::Any, ::Vararg{Any,N} where N)
1 ./essentials.jl:335; vector_any(::Any, ::Vararg{Any,N} where N)
19 ./io.jl:574; done
3 ./iostream.jl:74; eof(::IOStream)
100 ./REPL[1]:5; (::##1#2)(::IOStream)
6 ./strings/string.jl:164; length(::String)
33 ./strings/string.jl:165; length(::String)
Using `code_warntype` **in Julia 0.7.0-beta2.81** gives a similar output, so maybe it's the same problem in both versions of Julia:
```julia
julia> @code_warntype countchars("PF00089.fasta")
Body::Any
2 1 โ %1 = invoke Base.:(#open#304)(:($(QuoteNode(Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}())))::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, open::Function, getfield(Main, Symbol("##7#8"))()::getfield(Main, Symbol("##7#8")), %%filename::String, "r"::Vararg{String,N} where N)::Any
โโโ return %1 โ
I hope this information helps to solve the performance issue,
Best regards,
The stream
filed of the EachLine
iterator has type IO
:
https://github.com/JuliaLang/julia/blob/cbb643366937a85579cb5291c7b835a8e22ce7cb/base/io.jl#L845-L846
But:
julia> code_warntype(readline, Tuple{IO})
Body::String
429 1 โ goto 2 if not false โโป #readline#304
2 โ %2 = Base.getfield(%%s, :ios)::Array{UInt8,1} โโโป getproperty
โ Base.sle_int(0, 167772160) โโโโปโทโทโทโท convert
โ Base.ifelse(true, 10, 0) โโโโโป Type
โ %5 = ฯ (false, Bool) โโ
โโโ goto 3 if not %5 โโ
3 โ %7 = :($(Expr(:foreigncall, :(:jl_array_ptr), Ptr{UInt8}, svec(Any), :(:ccall), 1, Core.SSAValue(2))))::Ptr{UInt8}nvert
โ %8 = Base.bitcast(Ptr{Nothing}, %7)::Ptr{Nothing} โโโโป convert
โ %9 = :($(Expr(:foreigncall, :(:jl_readuntil), Ref{String}, svec(Ptr{Nothing}, UInt8, UInt8, UInt8), :(:ccall), 4, Core.SSAValue(8), 0x0a, 0x01, 0x02, 0x02, 0x01, 0x0a, Core.SSAValue(2))))::String
โโโ goto 4 โโ
4 โ return %9 โ
Body::Any
369 1 โ %1 = Base.:(#readline#276)(Base.nothing, false, %%#self#, %%s)::Any โ
โโโ return %1
The second method that cannot be inferred here is this one:
https://github.com/JuliaLang/julia/blob/cbb643366937a85579cb5291c7b835a8e22ce7cb/base/io.jl#L368-L382
The problem here seems to be that two of the readuntil
methods (readuntil(io::Base.AbstractPipe, arg::UInt8; kw...)
in Base at io.jl:232 and readuntil(this::Base.LibuvStream, c::UInt8; keep)
in Base at stream.jl:769) are inferred as Any
, and there are too many methods for String(::Any)
for inference.
Possible solutions (not mutually exclusive):
EachLine
for the type of stream
.::String
type assertions to the String(...)
calls.readuntil
methods and make them inferable (assuming they are actually type-stable).Can confirm that putting IO
as a type parameter on EachLine
solves inferrability issues:
Body::Nothing
2 1 โโ %1 = invoke Main.open(%%filename::String, "r"::String)::IOStream โ
4 โ invoke Core.kwfunc(Main.eachline::Any) โ
โ %3 = Base.slt_int(0, 1)::Bool โโปโทโทโทโท #eachline
โโโโ goto 3 if not %3 โโโโโโ isempty
2 โโ goto 4 โโโโโโ iterate
3 โโ invoke Base.getindex(()::Tuple{}, 1::Int64) โโโโโโ iterate
โโโโ unreachable โโโโโโ iterate
4 โโ goto 5 โโโโโ
5 โโ goto 6 โโโโป iterate
6 โโ goto 7 โโโ
7 โโ invoke Core.kwfunc(Base.EachLine::Any) โโโป #eachline#373
โ %12 = Base.slt_int(0, 1)::Bool โโโโปโทโทโทโท Type
โโโโ goto 9 if not %12 โโโโโโโโ isempty
8 โโ goto 10 โโโโโโโโ iterate
9 โโ invoke Base.getindex(()::Tuple{}, 1::Int64) โโโโโโโโ iterate
โโโโ unreachable โโโโโโโโ iterate
10 โ goto 11 โโโโโโโ
11 โ goto 12 โโโโโโป iterate
12 โ goto 13 โโโโโ
13 โ %20 = new(Base.EachLine{IOStream}, %1, getfield(Base, Symbol("##370#372"))(), true)::Base.EachLine{IOStream}
โโโโ goto 14 โโโโ
14 โ goto 15 โโโ
15 โ goto 16 โโ
16 โ %24 = :(Base.iterate)::Core.Compiler.Const(iterate, false) โ
โ %25 = invoke %24(%20::Base.EachLine{IOStream}, Base.nothing::Nothing)::Union{Nothing, Tuple{String,Nothing}}
โ %26 = Core.:(===)(%25, nothing)::Bool โ
โ %27 = Base.not_int(%26)::Bool โ
โโโโ goto 19 if not %27 โ
17 โ %29 = ฯ (16 => 0, 18 => %35)::Int64 โ
โ %30 = ฯ (16 => %25, 18 => %36)::Union{Nothing, Tuple{String,Nothing}}
โ %31 = ฯ (%30, Tuple{String,Nothing}) โ
โ %32 = Core.getfield(%31, 1)::String โ
โ %33 = Core.getfield(%31, 2)::Core.Compiler.Const(nothing, false) โ
5 โ %34 = invoke Main.length(%32::String)::Int64 โ
โ %35 = Base.add_int(%29, %34)::Int64 โโป +
โ %36 = invoke Base.iterate(%20::Base.EachLine{IOStream}, %33::Nothing)::Union{Nothing, Tuple{String,Nothing}}
โ %37 = Core.:(===)(%36, nothing)::Bool โ
โ %38 = Base.not_int(%37)::Bool โ
โโโโ goto 19 if not %38 โ
18 โ goto 17 โ
8 19 โ %41 = Base.getfield(%1, :ios)::Array{UInt8,1} โโปโท close
โ %42 = :($(Expr(:foreigncall, :(:jl_array_ptr), Ptr{UInt8}, svec(Any), :(:ccall), 1, Core.SSAValue(41))))::Ptr{UInt8}
โ %43 = Base.bitcast(Ptr{Nothing}, %42)::Ptr{Nothing} โโโโป convert
โ %44 = :($(Expr(:foreigncall, :(:ios_close), Nothing, svec(Ptr{Nothing}), :(:ccall), 1, Core.SSAValue(43), Core.SSAValue(41))))::Nothing
โโโโ return %44
and improves performance:
Without:
0.282460 seconds (1.64 M allocations: 57.981 MiB, 1.74% gc time)
With:
0.064163 seconds (1.09 M allocations: 49.241 MiB, 5.91% gc time)
Something seems broken in codegen, since even if we annotate the return type, @profile
seems to indicate it's still spending a large percentage of its time at runtime trying to figure out the return type of (Core.tuple)(%48, Base.nothing)::Tuple{String,Nothing}
(from code_typed(iterate, (Base.EachLine, Nothing))
)
With https://github.com/JuliaLang/julia/pull/28253 (Julia 0.7.0-beta2.126) countchars("PF00089.fasta")
takes 0.108174 seconds
in my computer instead of 0.405204 seconds
in Julia 0.7.0-beta2.8. So, now Julia is as fast as Python 2.7.12 (0.100 s) and faster than Python 3.5.2 (0.168 s).
I think the reason why we are comparable with Python 2.7.12 in this benchmark is that we are measuring the time spent in lenght(::String)
(https://discourse.julialang.org/t/performance-of-length-string/12672):
julia> Profile.@profile countchars("PF00089.fasta")
32301307
julia> Profile.print()
2 ./io.jl:896; iterate(::Base.EachLine{IOStream}, ::Nothing)
139 ./task.jl:262; (::getfield(REPL, Symbol("##28#29")){REPL.REPLBackend})()
139 /home/zea/bin/Julia/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:119; macro expansion
139 /home/zea/bin/Julia/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:87; eval_user_input(::Any, ::REPL.REPLBackend)
139 ./boot.jl:319; eval(::Module, ::Any)
139 /home/zea/bin/Julia/usr/share/julia/stdlib/v0.7/Profile/src/Profile.jl:27; top-level scope
139 ./REPL[1]:2; countchars
139 ./iostream.jl:367; open
139 ./iostream.jl:369; #open#304(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::getfield(Main, Symbol("##3#4"...
139 ./REPL[1]:5; (::getfield(Main, Symbol("##3#4")))(::IOStream)
4 ./io.jl:896; iterate(::Base.EachLine{IOStream}, ::Nothing)
4 ./iostream.jl:193; eof
3 ./pointer.jl:66; unsafe_convert
3 ./pointer.jl:65; unsafe_convert
80 ./io.jl:897; iterate(::Base.EachLine{IOStream}, ::Nothing)
1 ./boot.jl:321; kwfunc(::Any)
73 ./none:0; #readline
73 ./iostream.jl:433; #readline#306
53 ./strings/string.jl:269; length(::String)
1 ./int.jl:428; length
13 ./pointer.jl:0; length
1 ./strings/string.jl:273; length
1 ./strings/string.jl:88; codeunit
1 ./gcutils.jl:87; macro expansion
1 ./pointer.jl:105; unsafe_load
1 ./pointer.jl:105; unsafe_load
10 ./strings/string.jl:276; length
10 ./int.jl:53; +
18 ./strings/string.jl:277; length
10 ./strings/string.jl:278; length
10 ./strings/string.jl:88; codeunit
10 ./gcutils.jl:87; macro expansion
10 ./pointer.jl:105; unsafe_load
10 ./pointer.jl:105; unsafe_load
1 ./strings/string.jl:276; length(::String)
Counting lines instead of chars to not call length
gives:
julia> function countlines(filename)
open(filename, "r") do fh
c = 0
for line in eachline(fh, keep=true)
c += 1
end
c
end
end
julia> @time countlines("PF00089.fasta")
0.059442 seconds (1.09 M allocations: 49.240 MiB, 6.86% gc time)
544104
def countlines(filename):
with open(filename, "r") as fh:
c = 0
for line in fh:
c += 1
return c
In [2]: %time countlines("PF00089.fasta")
CPU times: user 56.8 ms, sys: 16.5 ms, total: 73.3 ms
Wall time: 72.7 ms
Out[2]: 544104
In [5]: %time countlines("PF00089.fasta")
CPU times: user 123 ms, sys: 4.26 ms, total: 128 ms
Wall time: 126 ms
Out[5]: 544104
So, we are really iterating lines in a file faster than Python :)
In [2]: %time s = readfasta("chr1.fa")
CPU times: user 2.05 s, sys: 235 ms, total: 2.29 s
Wall time: 2.29 s
julia> @time readfasta("chr1.fa");
2.090155 seconds (14.96 M allocations: 1.269 GiB, 19.81% gc time)
And using a but more idiomatic code
julia> function readfasta2(filename::String)
io = IOBuffer()
f = open(filename)
l0 = readline(f)
for line in eachline(f)
print(io, chomp(line))
end
return String(take!(io))
end
julia> @time readfasta2("chr1.fa");
1.012338 seconds (14.96 M allocations: 865.145 MiB)
I don't think there is anything specific that warrants this issue to be kept open? Feel free to reopen if you disagree.
Most helpful comment
Can confirm that putting
IO
as a type parameter onEachLine
solves inferrability issues:and improves performance:
Without:
With: