It has been defined in CSV.jl but now is deprecated. Do we need this function or:
CSV.File(...) |> DataFrame
is enough.
Please comment here to have a record of the dicussion.
CSV.read is pleasantly concise and the name matches other comparable software more closely, e.g. read.csv in R, read_csv in Pandas, etc. DataFrame(CSV.File(...)) is just not discoverable for users coming from other software. I understand the desire for separation of concerns and making CSV less opinionated about the particular tabular structure it uses, but this really seems like a huge blow to usability to me. Not to mention that omitting the ! in DataFrame!(CSV.File(...)), which is quite easy to do, will make unnecessary copies...
Not to mention that omitting the
!inDataFrame!(CSV.File(...)), which is quite easy to do, will make unnecessary copies...
Most of the time you want to omit !. Without it the resulting DataFrame has read-only columns.
The problem with doing a type piracy over CSV.read in DataFrames.jl is that some other package supporting Tables.jl interface could wish to use the same name.
@quinnj - maybe the solution is to define CSV.read in CSV.jl in the following way:
CSV.read(sink, file, args...; kwargs...) = CSV.File(file, args...; kwargs...) |> sink
In such a case CSV.read would be discoverable for the users and they just would learn that they need to specify sink?
Most of the time you want to omit !. Without it the resulting DataFrame has read-only columns.
This isn't accurate as of the 0.7 CSV.jl release; with 0.7, you get back fully mutable arrays.
This isn't accurate
Ah - it is plainly wrong :(. I have not tested this enough yet.
So now indeed using DataFrames! should be preferred if I understand things correctly. You only get SentinelArray if you have a non-PooledArray result that has missing values and still SentinelArray is resizeable and mutable. Is this correct?
I don't see the need to privledge CSV.jl over any other Table type.
Its nice and clear DataFame(table) works for any Table,
and CSV.File is the way to get a table from a CSV.
And there are ways to get tables from other sources that are similar.
Its not like its much extra typing, CSV has a very short package name.
maybe the solution is to define CSV.read in CSV.jl in the following way:
CSV.read(sink, file, args...; kwargs...) = CSV.File(file, args...; kwargs...) |> sink
CSV.read(DataFrame, file) is quite nice and explicit. Another option could be to simply have:
CSV.read(file; kwargs...) = CSV.File(file; kwargs...)
The idea is that CSV.File(file) is a perfectly valid table already (you can iterate over rows, access columns with getproperty), and can be converted to any other table if needed. CSV.read is IMO a more easily discoverable name than CSV.File, so it could be the public interface.
I think having both CSV.read(sink, file) and CSV.read(file) would be best for CSV. If DataFrames then wanted to be more performant they could override CSV.read as:
CSV.read(::Type{DataFrame}, file; kwargs...) = DataFrame!(CSV.File(file; kwargs))
to match Base.read it would want to be CSV.read(file, sink)
Very torn on this one - on the one hand I'm fully in the camp that says DataFrame(CSV.File(...)) is great because it introduces users to a better way of thinking about what CSV can offer, and how independent and composable things in Julia are, but on the other hand I already fear the time I'll spend on Discourse, SO, Slack and Zulip to argue with people claiming "Julia will never be a serious language until it offers a CSV.read function"...
On balance it therefore seems to me that the CSV.read(file, sink) option is the best of both worlds, as it exposes a public API that meets user expectations while at the same time getting people to explicitly think about the concept of a sink and what they actually need in terms of postprocessing.
I'm also kinda torn. Coming from pandas, I initially thought it was sort of absurd to remove/change CSV.read.
I've come around though, @bkamins has convinced me how valuable a concise API is.
That said CSV |> DataFrame is so commonplace (imo) that it should take very few characters, otherwise it runs the risk of being newcomer and REPL unfriendly.
CSV.read(file, sink) definitely seems to be the move here.
PR up! https://github.com/JuliaData/CSV.jl/pull/687. Feel free to comment there on implementation, or here on any other questions/concerns.
FYI, CSV.read now gives a deprecation that users need to do CSV.read(input, DataFrame) explicitly. Depending on when #1764 is implemented, we could consider taking a dependency on the DataFramesCore package and make that the default sink again.
Most helpful comment
CSV.read(DataFrame, file)is quite nice and explicit. Another option could be to simply have:The idea is that
CSV.File(file)is a perfectly valid table already (you can iterate over rows, access columns withgetproperty), and can be converted to any other table if needed.CSV.readis IMO a more easily discoverable name thanCSV.File, so it could be the public interface.