DataFrames.jl 🚀 - public API for accessing row number of `DataFrameRow`

One corner case I just thought of for this

julia> df = DataFrame(a = [1, 1, 2,2], b = rand(4));

julia> gd = groupby(df, "a");

julia> t = gd[2][1,:];

julia> getfield(t, :row)
3

pdeffebach on 22 Jul 2020

The corner case is that currently row always points at the parent DataFrame not at the row of SubDataFrame. This is due to performance reasons (as no intermediate index calculation is required). DataFrameRow is even not aware of existence of SubDataFrame. Since we have not exposed row in the past it was not a problem. However, if we were to add this functionality maybe this should be changed.

bkamins on 22 Jul 2020

Note that the behavior of unrolling to the true parent is the same as in Base:

julia> x = rand(3,4)
3×4 Array{Float64,2}:
 0.50024   0.728962   0.0994949  0.989262
 0.644985  0.185798   0.695228   0.260861
julia> x = reshape(1:16, 4, 4)
4×4 reshape(::UnitRange{Int64}, 4, 4) with eltype Int64:
 1  5   9  13
 2  6  10  14
 3  7  11  15
 4  8  12  16

julia> y = view(x, 2:3, 2:3)
2×2 view(reshape(::UnitRange{Int64}, 4, 4), 2:3, 2:3) with eltype Int64:
 6  10
 7  11

julia> parent(y)
4×4 reshape(::UnitRange{Int64}, 4, 4) with eltype Int64:
 1  5   9  13
 2  6  10  14
 3  7  11  15
 4  8  12  16

julia> parentindices(y)
(2:3, 2:3)

julia> z = view(y, 2:2, 2:2)
1×1 view(reshape(::UnitRange{Int64}, 4, 4), 3:3, 3:3) with eltype Int64:
 11

julia> parent(z)
4×4 reshape(::UnitRange{Int64}, 4, 4) with eltype Int64:
 1  5   9  13
 2  6  10  14
 3  7  11  15
 4  8  12  16

julia> parentindices(z)
(3:3, 3:3)

bkamins on 23 Jul 2020

This also reminded me that you can use parentindices to get row and column indices from the parent:

julia> df = DataFrame(x)
4×4 DataFrame
│ Row │ x1    │ x2    │ x3    │ x4    │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 5     │ 9     │ 13    │
│ 2   │ 2     │ 6     │ 10    │ 14    │
│ 3   │ 3     │ 7     │ 11    │ 15    │
│ 4   │ 4     │ 8     │ 12    │ 16    │

julia> dfy = view(df, 2:3, 2:3)
2×2 SubDataFrame
│ Row │ x2    │ x3    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 6     │ 10    │
│ 2   │ 7     │ 11    │

julia> parent(dfy)
4×4 DataFrame
│ Row │ x1    │ x2    │ x3    │ x4    │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 5     │ 9     │ 13    │
│ 2   │ 2     │ 6     │ 10    │ 14    │
│ 3   │ 3     │ 7     │ 11    │ 15    │
│ 4   │ 4     │ 8     │ 12    │ 16    │

julia> parentindices(dfy)
(2:3, 2:3)

julia> r = dfy[2, :]
DataFrameRow
│ Row │ x2    │ x3    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 3   │ 7     │ 11    │

julia> parent(r)
4×4 DataFrame
│ Row │ x1    │ x2    │ x3    │ x4    │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 5     │ 9     │ 13    │
│ 2   │ 2     │ 6     │ 10    │ 14    │
│ 3   │ 3     │ 7     │ 11    │ 15    │
│ 4   │ 4     │ 8     │ 12    │ 16    │

julia> parentindices(r)
(3, 2:3)

So in summary - the answer to your question is to use parentindices and get a first dimension from it.

However, as in Base, you get a row index in the source DataFrame which is not probably what you wanted - is this correct?

(if yes - then please comment and we can think of changing the design)

bkamins on 23 Jul 2020

For concreteness, this is the current behavior filtering on a subdataframe using rows.

julia> df = DataFrame(a = [1, 1, 2, 2],b = rand(4));

julia> gd = groupby(df, :a);

julia> combine(gd) do sdf
           filter(r -> getfield(r, :row) == 1, sdf)
       end
1×2 DataFrame
│ Row │ a     │ b       │
│     │ Int64 │ Float64 │
├─────┼───────┼─────────┤
│ 1   │ 1     │ 0.45063 │

pdeffebach on 24 Jul 2020

This is expected now (apart from the fact that field access is not part of the public API and rather parentindices(r)[1] should be used). My question is - do we want to change it?

bkamins on 24 Jul 2020

row_number is used in SQL and dplyr

matthieugomez on 26 Jul 2020

So the approach would be to add another field in DataFrameRow that would remember this value and add an exported method rownumber that would return it. Do you think we should also store the reference to the direct parent of DataFrameRow (rather than the true parent - as we currently do)?

bkamins on 26 Jul 2020

As long as we don't provide any public function to access the direct parent, storing it doesn't make sense. So the question is really: should parent(row) return the direct parent? That sounds logical, though it adds yet another field. But maybe it doesn't really matter for performance since DataFrameRow is quite slow already due to type instability.

nalimilan on 27 Jul 2020

parent(row) return the direct parent

It should not, as it is not consistent with Base. The design that we have now with parent and parentindices is made to match what Base does.

it adds yet another field. But maybe it doesn't really matter for performance since DataFrameRow is quite slow already due to type instability.

This was my thinking and that is why I have asked as we would have to add another field to return rownumber anyway.
Though we would need another function name for this, e.g. rowsource?

bkamins on 27 Jul 2020

Just to add parent in Base guarantees not to return view and this is the invariant we keep.

bkamins on 27 Jul 2020

👍1

also I would add rowsource only if we feel it is needed. I can see the need for rownumber, but maybe when someone works with DataFrameRow it is never needed to know rowsource as it is either obvious or not required?
The reason is that rownumber is Int and rowsource will add a parameter to DataFrameRow (leading to additional compilation passes in some casees and we already have 2 parameters there). Of course this is a minor issue, and if you feel rowsource is needed I will add it.

bkamins on 27 Jul 2020

This wouldn't work in transform(df, AsTable(:) => ByRow(row) => "rownum"), right? Since that is a Tables.namedtupleiterator rather than a DataFrameRow

w.r.t rowsource if you are taking a row from a SubDataFrame I can't imagine a use-case where you need to know the row number of the original data frame.

pdeffebach on 27 Jul 2020

This wouldn't work in transform(df, AsTable(:) => ByRow(row) => "rownum"), right?

Right

I can't imagine a use-case where you need to know the row number of the original data frame.

It is needed if you are writing a code that should be fast and want to avoid indirection of computing the actual row number in the true source. This is exactly how it is used now internally (and I guess this is the use case in Base for view in general)

The question is if you see a usecase of rowsource?

bkamins on 27 Jul 2020

Sorry, I'm not quite sure what rowsource returns. The SubDataFrame?

No I don't see a use-case for that. I see a use case for rownumber for sure. But the fact that it wouldnt work in piping makes me wonder if its worth introducing an inconsistency

pdeffebach on 27 Jul 2020

rowsource is not present yet. If we intorduced it it would return the direct data frame that DataFrameRow was created from.

There is no inconsistency. We will have parentindices which work like in Base and parent that also will work like in Base.

Then rownumber and rowsource would be new functions just in DataFrames.jl.

It is not useful in piping, but it is useful e.g. with eachrow (rownumber for sure; rowsource not that much as when you write eachrow(df) you are likely to know df :))

bkamins on 27 Jul 2020

okay. My votes are thus no to rowsource, yes to rownumber and maybe yes to some future hack to get it to work for piping.

pdeffebach on 27 Jul 2020

Yeah, rowsource sounds a little too strange to me. My vote is for rownumber as well.

ExpandingMan on 27 Jul 2020

Agreed, rownumber is the only function which is likely to be useful.

We could also provide a special object to get the row number without creating a DataFrameRow, e.g. transform(gd, RowNumber() => (n -> n ./ length(n)) => :rank).

nalimilan on 27 Jul 2020

Yes, then RowNumber() should be in general allowed in things like [:x1, RowNumber()]

bkamins on 27 Jul 2020

Apart from RowNumber() which we keep track of in https://github.com/JuliaData/DataFrames.jl/issues/2328 is the conclusion here that we just add rownumber (we should add it before 1.0 as it will change memory layout of DataFrameRow).

bkamins on 5 Aug 2020

we should add it before 1.0 as it will change memory layout of DataFrameRow.

I don't think we should treat internal fields as part of the public API. Otherwise our life is going to be quite difficult after 1.0. :-)

nalimilan on 6 Aug 2020

Agreed - still I would prefer to break such things as rarely as possible. Anyway - this seems to be a simple PR. Do we want to add rownumber function?

bkamins on 6 Aug 2020

👍1

see https://github.com/JuliaData/DataFrames.jl/pull/2356

bkamins on 7 Aug 2020

Something I realized is that with this addition, having parent(row) return the original parent doesn't make a lot of sense (apart from ensuring consistency with SubArray). It could be misleading and bug-prone if people call parent(row) and rownumber(row) expecting these to be the same. That's even more dangerous due to the fact that the discrepancy will only appear when using SubDataFrame, which isn't the most commonly tested pattern.

Maybe it would be better to accept the (limited) inconsistency with SubArray, and have parent(row) return the direct parent (storing it in an additional field)? SubArray doesn't have the same problem as we do, as it's less frequent to ask what's the index of a value in the direct parent; also there performance matters more than for DataFrameRow, which is slow anyway.

nalimilan on 8 Aug 2020

This was my concern originally and that is why:

I have added in the PR clear examples showing that this is not the case (a small help but still ...)
I have asked above if we want to be able to get a parent of DataFrameRow. I would not change how parent works as it is combined with parentindices (which refers to the DataFrame where data is actually stored). So maybe we should add another function like rowsource (the name I proposed above) that would return a direct parent.

I am not sure what is best (fortunately this feature is not pressing to add, and we can think about the best option).

(incidentally - I would expect parent and parentindices to work exactly as they work now even if I did not know DataFrames.jl - but probably I am in a minority, because probably most users do not think too much about what parent does in Base :smile:)

bkamins on 9 Aug 2020

Let's go with that, but then we should probably avoid advertising parent, as it's only useful for low-level code. If we find ourselves recommending parent (as we did lately to get back the original data frame after applying transform on a SubDataFrame subset of rows), we should introduce a function like rowsource. Though it would be nice to find a more general name that also makes sense for SubDataFrame.

This doesn't have to block adding rownumber though.

nalimilan on 9 Aug 2020

So let us move forward with #2356 as it is unaffected.

The comment about parent is extremely relevant I think. We should be very clear (and add more documentation on it) that parent always go down to DataFrame (actually in transform! on SubDataFrame this is what people will want to do anyway) and going to the direct source of the of the view from which it was created. I would discuss what is best name for it here.

Maybe sourcedataframe? Not very short, but explicit and I believe that it will not be required often. To support this we would have to add new fields to SubDataFrame and DataFrameRow though. So the question is - if we believe that there is high risk that people use parent(dataframerow) and try to use rownumber(dataframerow) on it instead parentindices(dataframerow)[1]?

bkamins on 9 Aug 2020

actually in transform! on SubDataFrame this is what people will want to do anyway

Why? Imagine a function which calls sdf = filter(f, adf,view=true); transform!(sdf ...) on its input adf::AbstractDataFrame and does something with adf after that. If the code was written in piping style instead, and called parent instead of reusing adf, it would get the wrong object if adf happened to be a SubDataFrame (which is the caller's decision).

Maybe sourcedataframe? Not very short, but explicit and I believe that it will not be required often. To support this we would have to add new fields to SubDataFrame and DataFrameRow though. So the question is - if we believe that there is high risk that people use parent(dataframerow) and try to use rownumber(dataframerow) on it instead parentindices(dataframerow)[1]?

I'm not sure the risk is high, I just think we should stop advertizing parent because it doesn't do what people probably expect it to do. Regarding the name, I'm not a fan of sourcedataframe as it repeats the type in the function name. But it's hard to find a good name that wouldn't be quite broad...

nalimilan on 10 Aug 2020

Indeed in the piping workflow for view it might be good to be able to get a direct source.

Maybe let us call it viewsource? It is general, but does not seem likely to be used elsewhere? Also it would apply both to DataFrameRow and SubDataFrame (which are both technically views).

Finally - for completeness - the question is if we see any use of an equivalent of parentindices but related to source, which could be called e.g. sourceindices. I would prefer not to store it, as it would be expensive to compute in some cases, but let us think, if it would be of any use?

bkamins on 10 Aug 2020

Indeed in the piping workflow for view it might be good to be able to get a direct source.

Maybe let us call it viewsource? It is general, but does not seem likely to be used elsewhere? Also it would apply both to DataFrameRow and SubDataFrame (which are both technically views).

It would be useful to see in which cases this function is most likely to be used, to ensure that the name is intuitive in that context.

Finally - for completeness - the question is if we see any use of an equivalent of parentindices but related to source, which could be called e.g. sourceindices. I would prefer not to store it, as it would be expensive to compute in some cases, but let us think, if it would be of any use?

Ah, right. It's annoying because we can't compute it after the fact like with GroupedDataFrame fields. Hopefully it's not needed in practice...

nalimilan on 11 Aug 2020

this function is most likely to be used

the only real use is in piping after e.g. transform I think.

can't compute it after the fact

we can compute it dynamically each time on request if needed (but it will be expensive also)

bkamins on 11 Aug 2020

we can compute it dynamically each time on request if needed (but it will be expensive also)

Ah, OK, so we could compute it the first time it's used and store the result. Anyway we don't need to implement that, it's just theoretically possible.

nalimilan on 11 Aug 2020

Yes we could, as assuming we store the source we have two mappings:

source to parent
this object to parent

and thus it is theoretically possible to compute the mapping between this object and source using this :).

bkamins on 11 Aug 2020

But I would not store it as it would mean views would have to become mutable struct and we do not want this.

bkamins on 11 Aug 2020

👍1

When #2356 is merged this issue will be closed. Let us keep discussing adding viewsource and sourceindices functions in #2371 as this is a separate issue.

bkamins on 16 Aug 2020

Dataframes.jl: public API for accessing row number of `DataFrameRow`

All 36 comments

Related issues