Dataframes.jl: Add a wrapper type for passing named tuples to functions when transforming

Created on 17 Feb 2020  ยท  11Comments  ยท  Source: JuliaData/DataFrames.jl

See:
https://github.com/JuliaData/DataFrames.jl/issues/1935#issuecomment-586967550
for a dicsussion.

This applies to: combine, by, select, transform and filter.

non-breaking

All 11 comments

Continuing the discussion from #1935, it's still ambiguous to me how this interacts with broadcasted selection .=> tranform pairs.

Consider a DataFrame with an :age column and a large number of measurement columns (for example, let's say [:x, :y, :z]). Perhaps you want to see what the correlation is between age and each other column.

# knowing what they're called in advance, you can do
Splat.(:age, [:x, :y, :z]) .=> cor

# if instead, we don't know what the measurement columns are 
# named, we might use a `Regex` or `Not`
Splat.(:age, Not(:age)) .=> cor

This raises some ambiguity. Syntactically, you can't broadcast Splat (without a specialized method) because Not(:age) doesn't have a length and doesn't yet know what columns it's selecting. Similar ambiguity arises if you're trying to work with pairwise combinations of columns.

I don't think all of these necessarily need to be accommodated, as such a niche use case might not warrant such syntactic shorthand, but raise it for consideration about where to draw this line.

We could add broadcasting to Splat in the future, but I do not find it a crucial functionality for now. In such complex cases I think it is cleaner to just write:

by(df, :col) do sdf
    whatever_you_need_to_do_with_your_sdf
end

We have decided to auto-splat by default (to lower compilation cost).
The actions are the following:

  1. in this issue we should decide what wrapper we want to use for the old behavior (i.e. to make a function pass a NamedTuple), my initial idea is NT, but it does not seem very nice, maybe Table is a better name.
  2. I will update https://github.com/JuliaData/DataFrames.jl/pull/2091 where I will propose the API for selecting columns for auto-splatting (the difference is that we will allow duplicates in auto-splatting, which was not allowed for NamedTuple API)
  3. Then I will update https://github.com/JuliaData/DataFrames.jl/pull/2080
  4. Then we have to clean up by/combine API (this is the hard part, as we have to go through deprecation period here)

In order to resolve this we need to:

  • settle of the name of the wrapper that signals passing of a NamedTuple; is NT OK
  • after #2080 is merged combinehas to be updated

We have settled to auto-splat.

So now the only thing to track is that in some future we should add a NT wrapper to allow passing a named tuple instead of auto-splatting to a function.

Probably NT is not a great name so a crucial thing it to have a good idea here.

I can't think of a better name than NT but am definitely looking forward to this functionality.

An added benefit is that passing a named tuple of vectors means that any function written for generic Tables will work when we pass a named tuple. People can write generic functions using Query or TableOperations and it will work.

This is a good point! Maybe we should settle for NTif no one has a better name (it seems pretty natural and is short).

NT sounds really to obscure. We never use acronyms like this. WithNames? Anything with an explicit name would be better IMHO.

So maybe AsTable?

@nalimilan - this is the last pending decision for 1.0 release. All else is decided and implemented - just waiting for reviews and merging.

So do we want to go for the AsTable name? (or you prefer WithNames?) - I prefer AsTable as it is a bit shorter and actually the user does not care if the table is named tuple or something else. We pass NamedTuple for performance.

Alternatively we can decide to make a 0,21 release without this feature and discuss it after the release (but possibly before 1.0 - we will probably have several months to think about it).

What is your opinion on this?

AsTable sounds good.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bbrunaud picture bbrunaud  ยท  3Comments

ahalwright picture ahalwright  ยท  3Comments

cormullion picture cormullion  ยท  6Comments

bkamins picture bkamins  ยท  7Comments

bkamins picture bkamins  ยท  8Comments