Julia: `readdlm` dictionary parsing default behavior

Created on 10 Jun 2016  Â·  19Comments  Â·  Source: JuliaLang/julia

julia> english = readdlm(IOBuffer(readstring("/usr/share/dict/words")),  ' ')
235886×1 Array{Any,2}:
 "A"
 "a"
 "aa"
 "aal"
 "aalii"
 "aam"
 "Aani"
 "aardvark"
 "aardwolf"
 "Aaron"
 "Aaronic"
 "Aaronical"
 "Aaronite"
 "Aaronitic"
 "Aaru"
 "Ab"
 "aba"
 "Ababdeh"
 "Ababua"
 "abac"
 "abaca"
 "abacate"
 "abacay"
 â‹®
 "zymoscope"
 "zymosimeter"
 "zymosis"
 "zymosterol"
 "zymosthenic"
 "zymotechnic"
 "zymotechnical"
 "zymotechnics"
 "zymotechny"
 "zymotic"
 "zymotically"
 "zymotize"
 "zymotoxic"
 "zymurgy"
 "Zyrenian"
 "Zyrian"
 "Zyryan"
 "zythem"
 "Zythia"
 "zythum"
 "Zyzomys"
 "Zyzzogeton"

julia> english_lower = map(lowercase, english);
ERROR: MethodError: no method matching lowercase(::Bool)
 in collect_to!(::Array{String,2}, ::Base.Generator{Array{Any,2},Base.#lowercase}, ::Int64, ::Int64) at ./array.jl:267
 in _collect(::Array{Any,2}, ::Base.Generator{Array{Any,2},Base.#lowercase}, ::Base.EltypeUnknown, ::Base.HasShape) at ./array.jl:249
 in map(::Function, ::Array{Any,2}) at ./abstractarray.jl:1131
 in eval(::Module, ::Any) at ./boot.jl:225
 in macro expansion at ./REPL.jl:92 [inlined]
 in (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:46

julia> Set(typeof(x) for x in english)
Set(Any[Bool,SubString{String},Float64])

julia> english_str = map(strip, split(readstring("/usr/share/dict/words")));

julia> foreach(english_str) do x
       try
       parse(Float64, x)
       println(x)
       end
       end
infinity
Nan
nan
O bug collections

Most helpful comment

Even if the code that's currently in base is going to be improved, there's not much reason for it to remain in this repository and coupled to the release schedule of the core language and compiler.

All 19 comments

What did you expect from can_parse_float? The words that can be parsed as Float64s are already Float64s

julia> filter(t -> isa(t, Float64), english)
3-element Array{Any,1}:
 Inf
 NaN
 NaN

and the second argument in parse should be a string.

julia> parse(Float64, NaN)
ERROR: MethodError: no method matching parse(::Type{Float64}, ::Float64)
Closest candidates are:
  parse{T<:AbstractFloat}(::Type{T<:AbstractFloat}, ::AbstractString)
 in eval(::Module, ::Any) at ./boot.jl:225
 in macro expansion at ./REPL.jl:92 [inlined]
 in (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:46

Yes, I am not thinking this early. I updated the original comment. I feel that this should return a list of strings by default (would be what I would expect) but this is debatable because Inf / NaN / true / false are in the dictionary.

I think every other DLM reader I've encountered will first check if a column is entirely boolean/numeric, and if not will just return strings for each cell. Returning a hybrid of both is weird.

Solution: delete readdlm.

@tanmaykm said he is working on a better readdlm. And deleting it is not a solution, unless you have something better to offer.

How is deleting it not a solution? Both DataFrames and CSV.jl already offer better solutions to this.

Even if the code that's currently in base is going to be improved, there's not much reason for it to remain in this repository and coupled to the release schedule of the core language and compiler.

💯 CSV parsing is complicated enough that it should be moved out of the base language.

Yes, CSV parsing is complicated, but dlm parsing is not. The current Julia repo has a bunch of other things that make things convenient for users of such systems - and I don't see why reading simple delimited files is the thing that has to be booted. Why do I have to install a package to load a simple file?

This implementation is a nightmare and it's awful to maintain when we change things in Base.

The fact that we have readdlm in base leads to also having readcsv in base, which leads to users, naïvely (the fools!) actually trying to use it to read CSV files, only to be told, "don't do that, you should use DataFrames or CSV". That is a ludicrous situation.

I think we should have readdlm only, and have it only support simple
delimited files, not csv, and remove the readcsv function from Base.

readcsv is only 2 lines, and I am fine with removing them to avoid the confusion.

readcsv(io; opts...)          = readdlm(io, ','; opts...)
readcsv(io, T::Type; opts...) = readdlm(io, ',', T; opts...)

The implementation is a nightmare for good reason. I think it will simplify over time as the compiler gets better, and by being strict about the files that it accepts.

+1 for removing. Apart from threads by people complaining about poor performance and lack of options of this function, Julia has been criticized for having a too big standard library. readdlm is a typical case as it's not even in a submodule which could eventually become a separate package.

Yes, I think we should deprecate readcsv right away, and plan to move readdlm into the "standard library" when that process happens, hopefully early in the 0.6 cycle.

reading in simple homogeneous (typed) text files is really handy esp. for testing so delegating this job to CSV.jl or similar package seems like overkill. I think that readdlm should be similar to Numpy's loadtxt. It handles the simple cases _reliably_ by specifying the input type up front and punting on any input that is hard. I would rather have something simple that works reliably than something super complicated and magical that does not.

I would be fine with having something simple, fast and reliable in Base; that is not what we have atm.

Was this page helpful?
0 / 5 - 0 ratings