https://juliadata.github.io/DataFrames.jl/stable/lib/types/#DataFrames.DataFrameRow mentions using the copy function to convert a row to a NamedTuple, and it just calls NamedTuple(::DataFrameRow)
To convert a GroupKey into a NamedTuple, there is no copy method, but there is NamedTuple(key::GroupKey)
This strikes me as inconsistent, but perhaps there's a reason. My instinct is that the documentation should just suggest using the NamedTuple(::T) method in both cases.
The reason of the design is that DataFrameRow is a view, so copy should de-reference it (like in other views in Base).
The documentation uses copy to highlight the fact that DataFrameRow should be essentially treated as-if it were a NamedTuple except that it is a view.
So there are two action points to be decided:
NamedTuple instead of copy. I understand you think it would be better?copy to GroupKey. This is a new addition so not API is present, as such additions are considered when needed. Are there some purposes where you would you need it?we can update the docs to recommend
NamedTupleinstead ofcopy. I understand you think it would be better?
Yes, that feels more obvious to me.
Actually we could add copy to GroupKey. This is a new addition so not API is present, as such additions are considered when needed. Are there some purposes where you would you need it?
The main surprise for me was:
df = DataFrame(
a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8)
gd = groupby(df, :a)
@show haskey(gd, (a=1,))
@show in((a=1,), keys(gd))
producing the inconsistent
haskey(gd, (a = 1,)) = true
(a = 1,) in keys(gd) = false
And it's because
@show keys(gd) .== Ref((a=1,))
@show NamedTuple.(keys(gd)) .== Ref((a=1,))
keys(gd) .== Ref((a = 1,)) = Bool[0, 0, 0, 0]
NamedTuple.(keys(gd)) .== Ref((a = 1,)) = Bool[1, 0, 0, 0]
So the journey started with discovering GroupKey: (a = 1,) doesn't equal (a=1,). So I figured I should convert it to a NamedTuple. I had previously learned to use copy for a DataFrameRow. I didn't see acopy(::GroupKey), but I foundNamedTuple(::GroupKey), and then I also foundNamedTuple(::DataFrameRow)`.
In my opinion, the docs should recommend NamedTuple instead of copy, and also ensure that ==(::NamedTuple, ::GroupKey) and vice-versa work so that haskey(df, k) == in(k, keys(df))
OK. I will split it to three PRs as this is independent:
DataFrameRowTuple, NamedTuple, Vector, and Array methods== and isequal (for this I have written down some notes to think about below)Indeed we can add == and isequal definitions.
Intuitively I think that in should use isequal test and use a fast path in case of GroupKeys. Or maybe even we should have == use two-valued logic here, but this is disputable. Not sure what is best (for sure this is intuitive to work this way but doing NamedTuple.(::GroupKeys) will use the standard definition of == so it will lead to an inconsistency). But this can be worked out later
There is also a decision if we should do the same with Tuple form and Dict in the future (see https://github.com/JuliaData/DataFrames.jl/pull/2281). But I think not. We should treat NamedTuple as a canonical representation.
Thanks!
On a related note, I think having equality between GroupKey: (a = 1,) and (a=1,) means that they should also hash to the same value, too.
The question around isequal vs == has to do with what should happen with missing and NaN, right?
Yes - these are the considerations and they ca get tricky if we want to be efficient at the same time.
That is why I left it for later. Still we can get a problem that gk1 == gk1 is true, but Tuple(gk1) == Tuple(gk1) is missing.
On the other hand if we have two GroupedDataFrames that are identical (but on copied data frames) gk1 == gd2 will be false even if NamedTuple(gk1) == NamedTuple(gk2) is true (we take keys referring to the same group).
In short - some rules have to be defined to what extent we want to have a consistency. Feel free to comment on what you feel would be good if you have an opinion.
In #2308 I cover the stuff that does not require such decisions.
I don't think == should differ from its behavior for (named) tuples. If you want to skip missing values use isequal like everywhere.
I understand what you say, but in gk1 == gk1 we know that even if there is a missing inside (representing unknown value) actually it must be the same as it is the same gk1, so it is not possible that these two objects would differ after the missing got eventually measured.
This might be an elucidating example
julia> in(NaN, keys(Dict(NaN=>:a)))
true
julia> in(NaN, collect(keys(Dict(NaN=>:a))))
false
julia> in(missing, keys(Dict(missing=>:a)))
true
julia> in(missing, collect(keys(Dict(missing=>:a))))
missing
The behavior is not handled in == in other parts of the ecosystem, but in collection of keys. So I agree with @nalimilan that the behavior for == should be consistent with NamedTuples, and instead what should be defined is the appropriate in(::GroupKey, ::GroupKeys) and in(::NamedTuple, ::GroupKeys)
Yes, I know it sounds absurd but I'd favor consistency here. If in is defined to use isequal like for AbstractDict (which is the most natural analogy here), we don't really care about == anyway.
Points 1. and 2. from my list are done. So we are left with:
the appropriate
in(::GroupKey, ::GroupKeys)andin(::NamedTuple, ::GroupKeys)
i.e. we would make sure that in works correctly, but leave == and isequal as they are now. Right?
No I wasn't opposed to adding == and isequal methods. Indeed it sounds logical that copy(k::GroupKey) == k, which calls for equality between GroupKey and NamedTuple. I only objected to defining in in a manner which isn't consistent with == or isequal.
So to conclude, is this what we want?
== and isequal between NamedTuple and GroupKey should effectively convert GroupKey to NamedTuple and fallback to the defined comparison.== and isequal between Tuple and GroupKey should effectively convert GroupKey to Tuple and fallback to the defined comparison.== and isequal between GroupKey and GroupKey should check the group index in GroupedDataFrame (so it is always conclusive in particular)in for GroupKeys should have a special implementation that follows keyset rule (i.e. using haskey)haskey to GroupKeys that will use isequal for tests.If yes then @goretkin - let me please know if you want to go forward with this PR. If not I can implement it without a problem.
Makes sense. There's one difficulty with transitivity of equality between Tuple and GroupKey, though: two GroupKey objects with different names but same values would compare equal to a tuple, while they wouldn't compare equal to each other. Also, named tuples and tuples don't compare equal. Implementation-wise, since isequal has to be consistent with hash, we can't define isequal so that GroupKey can compare equal to both tuples and named tuple: hash has to be consistent with one type or the other.
OK - so we should error on Tuple when using ==, isequal to make sure users use only what is properly handled, i.e. NamedTuple (this will avoid mistakes from user's side).
Still haskey and in probably can be supported for Tuple as this is not problematic right?
I'm not aware of any case of == throwing errors, so I'd say no. Otherwise GroupKey might be unusable with some code that expects it to always work.
haskey is OK I'd say, but in would be more problematic since it's tied to the definition of isequal. The inconsistency between haskey and in for tuples isn't great but probably better than the inconsistency between isequal and in.
So for me to be clear. Can you please summarize the rules for Tuple in ==, isequal, haskey and in that you propose here? Thank you.
No methods for ==, isequal and in (equivalent to always false), but keep the current method for haskey.
OK - clear. Just to be sure:
current method for
haskey.
AFAICT we do not have defined it yet, but I understand what we want to do is to do isequal comparison to all fields of GroupKey and if all pass as true then return true. Right?
Yes. At least that would be consistent with the fact that we allow tuples in getindex.
I have to go back and read more carefully. For now:
copy(k::GroupKey) == kIt seems logical, but I don't think this should be a requirement
julia> x = [1, NaN]; copy(x) == x
false
2.
I don't understand what role Tuple should play in this conversation. How I see it is that it's not relevant.
in(::NamedTuple, ::GroupKeys)haskey(::GroupDataFrame, ::NamedTuple)I don't understand what role
Tupleshould play in this conversation.
The role is that we allow to index GroupedDataFrame with Tuple.
In general: yes - please think of it carefully ๐, there is no rush - and it is easy to make a wrong decision, that we would regret later. Thank you!
It seems logical, but I don't think this should be a requirement
Yes, my language was a bit sloppy. What I meant is that copy(k::GroupKey) == k should at least return true in some cases (namely when names and elements are ==).
I think there should be a method
in(::NamedTuple, ::GroupKeys)
consistent with
haskey(::GroupDataFrame, ::NamedTuple)
Yes, we all agree on that AFAICT.
The role is that we allow to index
GroupedDataFramewithTuple.
Ah, okay, I missed that. Thanks for explaining. So then I understand now that this is a question of making getindex(g::GroupedDataFrame, x) and haskey(g::GroupedDataFrame, x) consistent.
Take the definition
function haskey2(c, x)
try
c[x]
return true
catch
end
return false
end
Is it correct to say that currently
haskey2(gd::GroupDataFrame, k::Tuple) could be true while haskey(gd::GroupDataFrame, k::Tuple) would be false?
No. They will be the same currently.
To make things precise. Currently for GroupDataFrame we useDict internally with Tuple as keys, and all behaviour is inherited from this implementation approach.
I understand that we are talking how to transfer these rules to GroupKeys object. And the conclusion is that it should behave the same as Dict that is currently stored inside a GroupedDataFrame. The cost of this is that isequal will be false for two different keys even if haskey passes for both of them when they point to the same group, but we agreed that this is acceptable.
So we can even define something like (probably with better method signatures):
haskey(gks::GroupKeys, x) = haskey(parent(gks), x)
haskey(gks::GroupKeys, x::Groupkey) = parent(gks) == parent(gk)
in(x, gks::GroupKeys) = haskey(gks, x)
and to not touch definitions of == and isequal.
haskey(gks::GroupKeys, x::Groupkey) = parent(gks) == parent(gk)
I guess you meant to use ===? One issue is that one GroupKey for a given GroupedDataFrame could be equal to a GroupKey for another GroupedDataFrame. In that case since haskey(NamedTuple(gk1), gd2) will be true, it would make sense for haskey(gk1, gd2) to be true as well.
in(x, gks::GroupKeys) = haskey(gks, x)
So as discussed above this would ensure in and haskey are consistent even for x::Tuple, but in would then be inconsistent with isequal. Not sure which one is best...
Ah - these things are tricky. Thinking of it I meant haskey(gks::GroupKeys, x::GroupKey) = parent(gks) === parent(gk), so GroupKey should be tied to index into the GroupedDataFrame it was created from. I think that having haskey(gk1, gd2) to be possible to return true makes sense, but I think it would do more harm than good (i.e. when someone writes something like this this is likely to be an error, also with == we have a problem with missing).
inwould then be inconsistent withisequal
This is what we currently have with haskey for GroupedDataFrame. I think we do not have to isequal test to pass, as we "overextend" indexing and haskey anyway here by allowing different types to be treated equally (this is not allowed in Base because there most likely haskey is associated with collections that are writable, but GroupedDataFrame is read-only so it is a less problem I think).
I think that having
haskey(gk1, gd2)to be possible to returntruemakes sense, but I think it would do more harm than good (i.e. when someone writes something like this this is likely to be an error, also with==we have a problem withmissing).
We could throw an error and see whether people complain (probably not).
This is what we currently have with
haskeyforGroupedDataFrame.
What do you mean? Currently for tuples we can have this:
julia> haskey(gd, (1,))
true
julia> (1,) in keys(gd)
false
We could throw an error and see whether people complain
I think it is OK to throw an error - this is probably going to catch some bugs (and it is easy enough to convert to NamedTuple to have interop between two GroupedDataFrames).
What do you mean?
Exactly, and we would have both of them return true or both return false, but what I mean is:
julia> df = DataFrame(g=1)
1ร1 DataFrame
โ Row โ g โ
โ โ Int64 โ
โโโโโโโผโโโโโโโโค
โ 1 โ 1 โ
julia> gdf = groupby(df, :g)
GroupedDataFrame with 1 group based on key: g
First Group (1 row): g = 1
โ Row โ g โ
โ โ Int64 โ
โโโโโโโผโโโโโโโโค
โ 1 โ 1 โ
julia> haskey(gdf, (1,))
true
julia> haskey(gdf, (g=1,))
true
julia> haskey(gdf, keys(gdf)[1])
true
julia> isequal(keys(gdf)[1], (1,))
false
julia> isequal(keys(gdf)[1], (g=1,))
false
So for GroupedDataFrame we already have that haskey is inconsistent with isequal and this is to be expected.
From a docsting of in the relevant part is I think:
Some collections follow a slightly different definition. (...) To test for the presence of a key in a dictionary, use
haskeyork in keys(dict).
So what Base actually requires is that haskey and in are giving the same result and this is what I think we should ensure. On the other hand - Base allows collections some flexibility wrt. == and isequal handling observing that it is not guaranteed and different collections can do it a bit differently. At least this is how I would interpret it :).
So what Base actually requires is that
haskeyandinare giving the same result and this is what I think we should ensure. On the other hand - Base allows collections some flexibility wrt.==andisequalhandling observing that it is not guaranteed and different collections can do it a bit differently. At least this is how I would interpret it :).
"Requires" may be a bit too strong. What the docstring says is that inwith some collections like dicts use isequal instead of ==. This matches what haskey does for dicts. But it doesn't say that other collections should do the same.
Anyway, yeah, it's true that consistency with dicts is a worthy goal. Maybe that's more important than consistency between in and isequal.