Dataframes.jl: Do we need CSV.read?

Created on 29 Jun 2020 · 12Comments · Source: JuliaData/DataFrames.jl

It has been defined in CSV.jl but now is deprecated. Do we need this function or:

CSV.File(...) |> DataFrame

is enough.

Please comment here to have a record of the dicussion.

decision

Source

bkamins

Most helpful comment

maybe the solution is to define CSV.read in CSV.jl in the following way:
CSV.read(sink, file, args...; kwargs...) = CSV.File(file, args...; kwargs...) |> sink

CSV.read(DataFrame, file) is quite nice and explicit. Another option could be to simply have:

CSV.read(file; kwargs...) = CSV.File(file; kwargs...)

The idea is that CSV.File(file) is a perfectly valid table already (you can iterate over rows, access columns with getproperty), and can be converted to any other table if needed. CSV.read is IMO a more easily discoverable name than CSV.File, so it could be the public interface.

piever on 30 Jun 2020

👍8

All 12 comments

CSV.read is pleasantly concise and the name matches other comparable software more closely, e.g. read.csv in R, read_csv in Pandas, etc. DataFrame(CSV.File(...)) is just not discoverable for users coming from other software. I understand the desire for separation of concerns and making CSV less opinionated about the particular tabular structure it uses, but this really seems like a huge blow to usability to me. Not to mention that omitting the ! in DataFrame!(CSV.File(...)), which is quite easy to do, will make unnecessary copies...

ararslan on 29 Jun 2020

👍2

Not to mention that omitting the ! in DataFrame!(CSV.File(...)), which is quite easy to do, will make unnecessary copies...

Most of the time you want to omit !. Without it the resulting DataFrame has read-only columns.

The problem with doing a type piracy over CSV.read in DataFrames.jl is that some other package supporting Tables.jl interface could wish to use the same name.

@quinnj - maybe the solution is to define CSV.read in CSV.jl in the following way:

CSV.read(sink, file, args...; kwargs...) = CSV.File(file, args...; kwargs...) |> sink

In such a case CSV.read would be discoverable for the users and they just would learn that they need to specify sink?

bkamins on 29 Jun 2020

Most of the time you want to omit !. Without it the resulting DataFrame has read-only columns.

This isn't accurate as of the 0.7 CSV.jl release; with 0.7, you get back fully mutable arrays.

quinnj on 29 Jun 2020

This isn't accurate

Ah - it is plainly wrong :(. I have not tested this enough yet.

So now indeed using DataFrames! should be preferred if I understand things correctly. You only get SentinelArray if you have a non-PooledArray result that has missing values and still SentinelArray is resizeable and mutable. Is this correct?

bkamins on 29 Jun 2020

I don't see the need to privledge CSV.jl over any other Table type.
Its nice and clear DataFame(table) works for any Table,
and CSV.File is the way to get a table from a CSV.
And there are ways to get tables from other sources that are similar.

Its not like its much extra typing, CSV has a very short package name.

oxinabox on 29 Jun 2020

👍2

maybe the solution is to define CSV.read in CSV.jl in the following way:
CSV.read(sink, file, args...; kwargs...) = CSV.File(file, args...; kwargs...) |> sink

CSV.read(DataFrame, file) is quite nice and explicit. Another option could be to simply have:

CSV.read(file; kwargs...) = CSV.File(file; kwargs...)

piever on 30 Jun 2020

👍8

I think having both CSV.read(sink, file) and CSV.read(file) would be best for CSV. If DataFrames then wanted to be more performant they could override CSV.read as:

CSV.read(::Type{DataFrame}, file; kwargs...) = DataFrame!(CSV.File(file; kwargs))

iamed2 on 30 Jun 2020

👍2

to match Base.read it would want to be CSV.read(file, sink)

oxinabox on 30 Jun 2020

👍2

Very torn on this one - on the one hand I'm fully in the camp that says DataFrame(CSV.File(...)) is great because it introduces users to a better way of thinking about what CSV can offer, and how independent and composable things in Julia are, but on the other hand I already fear the time I'll spend on Discourse, SO, Slack and Zulip to argue with people claiming "Julia will never be a serious language until it offers a CSV.read function"...

On balance it therefore seems to me that the CSV.read(file, sink) option is the best of both worlds, as it exposes a public API that meets user expectations while at the same time getting people to explicitly think about the concept of a sink and what they actually need in terms of postprocessing.

nilshg on 30 Jun 2020

👍2

I'm also kinda torn. Coming from pandas, I initially thought it was sort of absurd to remove/change CSV.read.

I've come around though, @bkamins has convinced me how valuable a concise API is.

That said CSV |> DataFrame is so commonplace (imo) that it should take very few characters, otherwise it runs the risk of being newcomer and REPL unfriendly.

CSV.read(file, sink) definitely seems to be the move here.

anandijain on 10 Jul 2020

PR up! https://github.com/JuliaData/CSV.jl/pull/687. Feel free to comment there on implementation, or here on any other questions/concerns.

quinnj on 10 Jul 2020

❤2

FYI, CSV.read now gives a deprecation that users need to do CSV.read(input, DataFrame) explicitly. Depending on when #1764 is implemented, we could consider taking a dependency on the DataFramesCore package and make that the default sink again.

quinnj on 28 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Documentation enhancement

blackeneth · 5Comments

More intuitive functions

bbrunaud · 3Comments

Allow data frame and DataFrameRow to take part in broadcasting

bkamins · 8Comments

Make a new release

rofinn · 3Comments

Problems in groupreduce_init

bkamins · 8Comments