Dataframes.jl: `GroupKey` and `DataFrameRow` to `NamedTuple`

Created on 29 Jun 2020  ยท  32Comments  ยท  Source: JuliaData/DataFrames.jl

https://juliadata.github.io/DataFrames.jl/stable/lib/types/#DataFrames.DataFrameRow mentions using the copy function to convert a row to a NamedTuple, and it just calls NamedTuple(::DataFrameRow)

To convert a GroupKey into a NamedTuple, there is no copy method, but there is NamedTuple(key::GroupKey)

This strikes me as inconsistent, but perhaps there's a reason. My instinct is that the documentation should just suggest using the NamedTuple(::T) method in both cases.

breaking doc feature

All 32 comments

The reason of the design is that DataFrameRow is a view, so copy should de-reference it (like in other views in Base).
The documentation uses copy to highlight the fact that DataFrameRow should be essentially treated as-if it were a NamedTuple except that it is a view.

So there are two action points to be decided:

  1. we can update the docs to recommend NamedTuple instead of copy. I understand you think it would be better?
  2. Actually we could add copy to GroupKey. This is a new addition so not API is present, as such additions are considered when needed. Are there some purposes where you would you need it?

we can update the docs to recommend NamedTuple instead of copy. I understand you think it would be better?

Yes, that feels more obvious to me.

Actually we could add copy to GroupKey. This is a new addition so not API is present, as such additions are considered when needed. Are there some purposes where you would you need it?

The main surprise for me was:

df = DataFrame(
  a = repeat([1, 2, 3, 4], outer=[2]),
  b = repeat([2, 1], outer=[4]),
  c = 1:8)

gd = groupby(df, :a)

@show haskey(gd, (a=1,))
@show in((a=1,), keys(gd))

producing the inconsistent

haskey(gd, (a = 1,)) = true
(a = 1,) in keys(gd) = false

And it's because

@show keys(gd) .== Ref((a=1,))
@show NamedTuple.(keys(gd)) .== Ref((a=1,))

keys(gd) .== Ref((a = 1,)) = Bool[0, 0, 0, 0]
NamedTuple.(keys(gd)) .== Ref((a = 1,)) = Bool[1, 0, 0, 0]

So the journey started with discovering GroupKey: (a = 1,) doesn't equal (a=1,). So I figured I should convert it to a NamedTuple. I had previously learned to use copy for a DataFrameRow. I didn't see acopy(::GroupKey), but I foundNamedTuple(::GroupKey), and then I also foundNamedTuple(::DataFrameRow)`.

In my opinion, the docs should recommend NamedTuple instead of copy, and also ensure that ==(::NamedTuple, ::GroupKey) and vice-versa work so that haskey(df, k) == in(k, keys(df))

OK. I will split it to three PRs as this is independent:

  1. documentation for DataFrameRow
  2. adding Tuple, NamedTuple, Vector, and Array methods
  3. adding == and isequal (for this I have written down some notes to think about below)

Indeed we can add == and isequal definitions.

Intuitively I think that in should use isequal test and use a fast path in case of GroupKeys. Or maybe even we should have == use two-valued logic here, but this is disputable. Not sure what is best (for sure this is intuitive to work this way but doing NamedTuple.(::GroupKeys) will use the standard definition of == so it will lead to an inconsistency). But this can be worked out later

There is also a decision if we should do the same with Tuple form and Dict in the future (see https://github.com/JuliaData/DataFrames.jl/pull/2281). But I think not. We should treat NamedTuple as a canonical representation.

Thanks!

On a related note, I think having equality between GroupKey: (a = 1,) and (a=1,) means that they should also hash to the same value, too.

The question around isequal vs == has to do with what should happen with missing and NaN, right?

Yes - these are the considerations and they ca get tricky if we want to be efficient at the same time.
That is why I left it for later. Still we can get a problem that gk1 == gk1 is true, but Tuple(gk1) == Tuple(gk1) is missing.
On the other hand if we have two GroupedDataFrames that are identical (but on copied data frames) gk1 == gd2 will be false even if NamedTuple(gk1) == NamedTuple(gk2) is true (we take keys referring to the same group).

In short - some rules have to be defined to what extent we want to have a consistency. Feel free to comment on what you feel would be good if you have an opinion.

In #2308 I cover the stuff that does not require such decisions.

I don't think == should differ from its behavior for (named) tuples. If you want to skip missing values use isequal like everywhere.

I understand what you say, but in gk1 == gk1 we know that even if there is a missing inside (representing unknown value) actually it must be the same as it is the same gk1, so it is not possible that these two objects would differ after the missing got eventually measured.

This might be an elucidating example

julia> in(NaN, keys(Dict(NaN=>:a)))
true

julia> in(NaN, collect(keys(Dict(NaN=>:a))))
false

julia> in(missing, keys(Dict(missing=>:a)))
true

julia> in(missing, collect(keys(Dict(missing=>:a))))
missing

The behavior is not handled in == in other parts of the ecosystem, but in collection of keys. So I agree with @nalimilan that the behavior for == should be consistent with NamedTuples, and instead what should be defined is the appropriate in(::GroupKey, ::GroupKeys) and in(::NamedTuple, ::GroupKeys)

Yes, I know it sounds absurd but I'd favor consistency here. If in is defined to use isequal like for AbstractDict (which is the most natural analogy here), we don't really care about == anyway.

Points 1. and 2. from my list are done. So we are left with:

the appropriate in(::GroupKey, ::GroupKeys) and in(::NamedTuple, ::GroupKeys)

i.e. we would make sure that in works correctly, but leave == and isequal as they are now. Right?

No I wasn't opposed to adding == and isequal methods. Indeed it sounds logical that copy(k::GroupKey) == k, which calls for equality between GroupKey and NamedTuple. I only objected to defining in in a manner which isn't consistent with == or isequal.

So to conclude, is this what we want?

  1. direct == and isequal between NamedTuple and GroupKey should effectively convert GroupKey to NamedTuple and fallback to the defined comparison.
  2. direct == and isequal between Tuple and GroupKey should effectively convert GroupKey to Tuple and fallback to the defined comparison.
  3. == and isequal between GroupKey and GroupKey should check the group index in GroupedDataFrame (so it is always conclusive in particular)
  4. in for GroupKeys should have a special implementation that follows keyset rule (i.e. using haskey)
  5. I would then add haskey to GroupKeys that will use isequal for tests.

If yes then @goretkin - let me please know if you want to go forward with this PR. If not I can implement it without a problem.

Makes sense. There's one difficulty with transitivity of equality between Tuple and GroupKey, though: two GroupKey objects with different names but same values would compare equal to a tuple, while they wouldn't compare equal to each other. Also, named tuples and tuples don't compare equal. Implementation-wise, since isequal has to be consistent with hash, we can't define isequal so that GroupKey can compare equal to both tuples and named tuple: hash has to be consistent with one type or the other.

OK - so we should error on Tuple when using ==, isequal to make sure users use only what is properly handled, i.e. NamedTuple (this will avoid mistakes from user's side).

Still haskey and in probably can be supported for Tuple as this is not problematic right?

I'm not aware of any case of == throwing errors, so I'd say no. Otherwise GroupKey might be unusable with some code that expects it to always work.

haskey is OK I'd say, but in would be more problematic since it's tied to the definition of isequal. The inconsistency between haskey and in for tuples isn't great but probably better than the inconsistency between isequal and in.

So for me to be clear. Can you please summarize the rules for Tuple in ==, isequal, haskey and in that you propose here? Thank you.

No methods for ==, isequal and in (equivalent to always false), but keep the current method for haskey.

OK - clear. Just to be sure:

current method for haskey.

AFAICT we do not have defined it yet, but I understand what we want to do is to do isequal comparison to all fields of GroupKey and if all pass as true then return true. Right?

Yes. At least that would be consistent with the fact that we allow tuples in getindex.

I have to go back and read more carefully. For now:


  1. > Indeed it sounds logical that copy(k::GroupKey) == k

It seems logical, but I don't think this should be a requirement

julia> x = [1, NaN]; copy(x) == x
false

2.
I don't understand what role Tuple should play in this conversation. How I see it is that it's not relevant.


  1. I think there should be a method
    in(::NamedTuple, ::GroupKeys)
    consistent with
    haskey(::GroupDataFrame, ::NamedTuple)

I don't understand what role Tuple should play in this conversation.

The role is that we allow to index GroupedDataFrame with Tuple.

In general: yes - please think of it carefully ๐Ÿ˜„, there is no rush - and it is easy to make a wrong decision, that we would regret later. Thank you!

It seems logical, but I don't think this should be a requirement

Yes, my language was a bit sloppy. What I meant is that copy(k::GroupKey) == k should at least return true in some cases (namely when names and elements are ==).

I think there should be a method
in(::NamedTuple, ::GroupKeys)
consistent with
haskey(::GroupDataFrame, ::NamedTuple)

Yes, we all agree on that AFAICT.

The role is that we allow to index GroupedDataFrame with Tuple.

Ah, okay, I missed that. Thanks for explaining. So then I understand now that this is a question of making getindex(g::GroupedDataFrame, x) and haskey(g::GroupedDataFrame, x) consistent.

Take the definition

function haskey2(c, x)
    try
        c[x]
        return true
    catch
    end
    return false
end

Is it correct to say that currently
haskey2(gd::GroupDataFrame, k::Tuple) could be true while haskey(gd::GroupDataFrame, k::Tuple) would be false?

No. They will be the same currently.

To make things precise. Currently for GroupDataFrame we useDict internally with Tuple as keys, and all behaviour is inherited from this implementation approach.

I understand that we are talking how to transfer these rules to GroupKeys object. And the conclusion is that it should behave the same as Dict that is currently stored inside a GroupedDataFrame. The cost of this is that isequal will be false for two different keys even if haskey passes for both of them when they point to the same group, but we agreed that this is acceptable.

So we can even define something like (probably with better method signatures):

haskey(gks::GroupKeys, x) = haskey(parent(gks), x)
haskey(gks::GroupKeys, x::Groupkey) = parent(gks) == parent(gk)
in(x, gks::GroupKeys) = haskey(gks, x)

and to not touch definitions of == and isequal.

haskey(gks::GroupKeys, x::Groupkey) = parent(gks) == parent(gk)

I guess you meant to use ===? One issue is that one GroupKey for a given GroupedDataFrame could be equal to a GroupKey for another GroupedDataFrame. In that case since haskey(NamedTuple(gk1), gd2) will be true, it would make sense for haskey(gk1, gd2) to be true as well.

in(x, gks::GroupKeys) = haskey(gks, x)

So as discussed above this would ensure in and haskey are consistent even for x::Tuple, but in would then be inconsistent with isequal. Not sure which one is best...

Ah - these things are tricky. Thinking of it I meant haskey(gks::GroupKeys, x::GroupKey) = parent(gks) === parent(gk), so GroupKey should be tied to index into the GroupedDataFrame it was created from. I think that having haskey(gk1, gd2) to be possible to return true makes sense, but I think it would do more harm than good (i.e. when someone writes something like this this is likely to be an error, also with == we have a problem with missing).

in would then be inconsistent with isequal

This is what we currently have with haskey for GroupedDataFrame. I think we do not have to isequal test to pass, as we "overextend" indexing and haskey anyway here by allowing different types to be treated equally (this is not allowed in Base because there most likely haskey is associated with collections that are writable, but GroupedDataFrame is read-only so it is a less problem I think).

I think that having haskey(gk1, gd2) to be possible to return true makes sense, but I think it would do more harm than good (i.e. when someone writes something like this this is likely to be an error, also with == we have a problem with missing).

We could throw an error and see whether people complain (probably not).

This is what we currently have with haskey for GroupedDataFrame.

What do you mean? Currently for tuples we can have this:

julia> haskey(gd, (1,))
true

julia> (1,) in keys(gd)
false

We could throw an error and see whether people complain

I think it is OK to throw an error - this is probably going to catch some bugs (and it is easy enough to convert to NamedTuple to have interop between two GroupedDataFrames).

What do you mean?

Exactly, and we would have both of them return true or both return false, but what I mean is:

julia> df = DataFrame(g=1)
1ร—1 DataFrame
โ”‚ Row โ”‚ g     โ”‚
โ”‚     โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚

julia> gdf = groupby(df, :g)
GroupedDataFrame with 1 group based on key: g
First Group (1 row): g = 1
โ”‚ Row โ”‚ g     โ”‚
โ”‚     โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚

julia> haskey(gdf, (1,))
true

julia> haskey(gdf, (g=1,))
true

julia> haskey(gdf, keys(gdf)[1])
true

julia> isequal(keys(gdf)[1], (1,))
false

julia> isequal(keys(gdf)[1], (g=1,))
false

So for GroupedDataFrame we already have that haskey is inconsistent with isequal and this is to be expected.

From a docsting of in the relevant part is I think:

Some collections follow a slightly different definition. (...) To test for the presence of a key in a dictionary, use haskey or k in keys(dict).

So what Base actually requires is that haskey and in are giving the same result and this is what I think we should ensure. On the other hand - Base allows collections some flexibility wrt. == and isequal handling observing that it is not guaranteed and different collections can do it a bit differently. At least this is how I would interpret it :).

So what Base actually requires is that haskey and in are giving the same result and this is what I think we should ensure. On the other hand - Base allows collections some flexibility wrt. == and isequal handling observing that it is not guaranteed and different collections can do it a bit differently. At least this is how I would interpret it :).

"Requires" may be a bit too strong. What the docstring says is that inwith some collections like dicts use isequal instead of ==. This matches what haskey does for dicts. But it doesn't say that other collections should do the same.

Anyway, yeah, it's true that consistency with dicts is a worthy goal. Maybe that's more important than consistency between in and isequal.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bkamins picture bkamins  ยท  8Comments

cossio picture cossio  ยท  5Comments

mattBrzezinski picture mattBrzezinski  ยท  5Comments

bkamins picture bkamins  ยท  8Comments

CameronBieganek picture CameronBieganek  ยท  6Comments