Julia: `readdlm` dictionary parsing default behavior

Created on 10 Jun 2016 · 19Comments · Source: JuliaLang/julia

julia> english = readdlm(IOBuffer(readstring("/usr/share/dict/words")),  ' ')
235886×1 Array{Any,2}:
 "A"
 "a"
 "aa"
 "aal"
 "aalii"
 "aam"
 "Aani"
 "aardvark"
 "aardwolf"
 "Aaron"
 "Aaronic"
 "Aaronical"
 "Aaronite"
 "Aaronitic"
 "Aaru"
 "Ab"
 "aba"
 "Ababdeh"
 "Ababua"
 "abac"
 "abaca"
 "abacate"
 "abacay"
 ⋮
 "zymoscope"
 "zymosimeter"
 "zymosis"
 "zymosterol"
 "zymosthenic"
 "zymotechnic"
 "zymotechnical"
 "zymotechnics"
 "zymotechny"
 "zymotic"
 "zymotically"
 "zymotize"
 "zymotoxic"
 "zymurgy"
 "Zyrenian"
 "Zyrian"
 "Zyryan"
 "zythem"
 "Zythia"
 "zythum"
 "Zyzomys"
 "Zyzzogeton"

julia> english_lower = map(lowercase, english);
ERROR: MethodError: no method matching lowercase(::Bool)
 in collect_to!(::Array{String,2}, ::Base.Generator{Array{Any,2},Base.#lowercase}, ::Int64, ::Int64) at ./array.jl:267
 in _collect(::Array{Any,2}, ::Base.Generator{Array{Any,2},Base.#lowercase}, ::Base.EltypeUnknown, ::Base.HasShape) at ./array.jl:249
 in map(::Function, ::Array{Any,2}) at ./abstractarray.jl:1131
 in eval(::Module, ::Any) at ./boot.jl:225
 in macro expansion at ./REPL.jl:92 [inlined]
 in (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:46

julia> Set(typeof(x) for x in english)
Set(Any[Bool,SubString{String},Float64])

julia> english_str = map(strip, split(readstring("/usr/share/dict/words")));

julia> foreach(english_str) do x
       try
       parse(Float64, x)
       println(x)
       end
       end
infinity
Nan
nan

O bug collections

Source

jakebolewski

Most helpful comment

Even if the code that's currently in base is going to be improved, there's not much reason for it to remain in this repository and coupled to the release schedule of the core language and compiler.

tkelman on 10 Jun 2016

👍3

All 19 comments

What did you expect from can_parse_float? The words that can be parsed as Float64s are already Float64s

julia> filter(t -> isa(t, Float64), english)
3-element Array{Any,1}:
 Inf
 NaN
 NaN

and the second argument in parse should be a string.

julia> parse(Float64, NaN)
ERROR: MethodError: no method matching parse(::Type{Float64}, ::Float64)
Closest candidates are:
  parse{T<:AbstractFloat}(::Type{T<:AbstractFloat}, ::AbstractString)
 in eval(::Module, ::Any) at ./boot.jl:225
 in macro expansion at ./REPL.jl:92 [inlined]
 in (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:46

andreasnoack on 10 Jun 2016

Yes, I am not thinking this early. I updated the original comment. I feel that this should return a list of strings by default (would be what I would expect) but this is debatable because Inf / NaN / true / false are in the dictionary.

jakebolewski on 10 Jun 2016

I think every other DLM reader I've encountered will first check if a column is entirely boolean/numeric, and if not will just return strings for each cell. Returning a hybrid of both is weird.

malmaud on 10 Jun 2016

👍3

Solution: delete readdlm.

StefanKarpinski on 10 Jun 2016

👎1 👍1

malmaud on 10 Jun 2016

😄2

@tanmaykm said he is working on a better readdlm. And deleting it is not a solution, unless you have something better to offer.

ViralBShah on 10 Jun 2016

How is deleting it not a solution? Both DataFrames and CSV.jl already offer better solutions to this.

StefanKarpinski on 10 Jun 2016

Even if the code that's currently in base is going to be improved, there's not much reason for it to remain in this repository and coupled to the release schedule of the core language and compiler.

tkelman on 10 Jun 2016

👍3

💯 CSV parsing is complicated enough that it should be moved out of the base language.

StefanKarpinski on 10 Jun 2016

Yes, CSV parsing is complicated, but dlm parsing is not. The current Julia repo has a bunch of other things that make things convenient for users of such systems - and I don't see why reading simple delimited files is the thing that has to be booted. Why do I have to install a package to load a simple file?

ViralBShah on 10 Jun 2016

This implementation is a nightmare and it's awful to maintain when we change things in Base.

StefanKarpinski on 10 Jun 2016

The fact that we have readdlm in base leads to also having readcsv in base, which leads to users, naïvely (the fools!) actually trying to use it to read CSV files, only to be told, "don't do that, you should use DataFrames or CSV". That is a ludicrous situation.

StefanKarpinski on 10 Jun 2016

I think we should have readdlm only, and have it only support simple
delimited files, not csv, and remove the readcsv function from Base.

JeffBezanson on 10 Jun 2016

readcsv is only 2 lines, and I am fine with removing them to avoid the confusion.

readcsv(io; opts...)          = readdlm(io, ','; opts...)
readcsv(io, T::Type; opts...) = readdlm(io, ',', T; opts...)

ViralBShah on 10 Jun 2016

The implementation is a nightmare for good reason. I think it will simplify over time as the compiler gets better, and by being strict about the files that it accepts.

ViralBShah on 10 Jun 2016

+1 for removing. Apart from threads by people complaining about poor performance and lack of options of this function, Julia has been criticized for having a too big standard library. readdlm is a typical case as it's not even in a submodule which could eventually become a separate package.

nalimilan on 10 Jun 2016

Yes, I think we should deprecate readcsv right away, and plan to move readdlm into the "standard library" when that process happens, hopefully early in the 0.6 cycle.

JeffBezanson on 10 Jun 2016

reading in simple homogeneous (typed) text files is really handy esp. for testing so delegating this job to CSV.jl or similar package seems like overkill. I think that readdlm should be similar to Numpy's loadtxt. It handles the simple cases _reliably_ by specifying the input type up front and punting on any input that is hard. I would rather have something simple that works reliably than something super complicated and magical that does not.

jakebolewski on 15 Jun 2016

I would be fine with having something simple, fast and reliable in Base; that is not what we have atm.

StefanKarpinski on 15 Jun 2016

Was this page helpful?

0 / 5 - 0 ratings