Dataframes.jl: public API for accessing row number of `DataFrameRow`

Created on 22 Jul 2020  ยท  36Comments  ยท  Source: JuliaData/DataFrames.jl

Internally this is called row but this symbol is probably far too common to export. rownumber seems like one possibility. I feel like it should probably be something from Base, but I'm not sure what, there isn't already an indexof or anything quite like that I think.

feature

All 36 comments

One corner case I just thought of for this

julia> df = DataFrame(a = [1, 1, 2,2], b = rand(4));

julia> gd = groupby(df, "a");

julia> t = gd[2][1,:];

julia> getfield(t, :row)
3

The corner case is that currently row always points at the parent DataFrame not at the row of SubDataFrame. This is due to performance reasons (as no intermediate index calculation is required). DataFrameRow is even not aware of existence of SubDataFrame. Since we have not exposed row in the past it was not a problem. However, if we were to add this functionality maybe this should be changed.

Note that the behavior of unrolling to the true parent is the same as in Base:

julia> x = rand(3,4)
3ร—4 Array{Float64,2}:
 0.50024   0.728962   0.0994949  0.989262
 0.644985  0.185798   0.695228   0.260861
julia> x = reshape(1:16, 4, 4)
4ร—4 reshape(::UnitRange{Int64}, 4, 4) with eltype Int64:
 1  5   9  13
 2  6  10  14
 3  7  11  15
 4  8  12  16

julia> y = view(x, 2:3, 2:3)
2ร—2 view(reshape(::UnitRange{Int64}, 4, 4), 2:3, 2:3) with eltype Int64:
 6  10
 7  11

julia> parent(y)
4ร—4 reshape(::UnitRange{Int64}, 4, 4) with eltype Int64:
 1  5   9  13
 2  6  10  14
 3  7  11  15
 4  8  12  16

julia> parentindices(y)
(2:3, 2:3)

julia> z = view(y, 2:2, 2:2)
1ร—1 view(reshape(::UnitRange{Int64}, 4, 4), 3:3, 3:3) with eltype Int64:
 11

julia> parent(z)
4ร—4 reshape(::UnitRange{Int64}, 4, 4) with eltype Int64:
 1  5   9  13
 2  6  10  14
 3  7  11  15
 4  8  12  16

julia> parentindices(z)
(3:3, 3:3)

This also reminded me that you can use parentindices to get row and column indices from the parent:

julia> df = DataFrame(x)
4ร—4 DataFrame
โ”‚ Row โ”‚ x1    โ”‚ x2    โ”‚ x3    โ”‚ x4    โ”‚
โ”‚     โ”‚ Int64 โ”‚ Int64 โ”‚ Int64 โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ 5     โ”‚ 9     โ”‚ 13    โ”‚
โ”‚ 2   โ”‚ 2     โ”‚ 6     โ”‚ 10    โ”‚ 14    โ”‚
โ”‚ 3   โ”‚ 3     โ”‚ 7     โ”‚ 11    โ”‚ 15    โ”‚
โ”‚ 4   โ”‚ 4     โ”‚ 8     โ”‚ 12    โ”‚ 16    โ”‚

julia> dfy = view(df, 2:3, 2:3)
2ร—2 SubDataFrame
โ”‚ Row โ”‚ x2    โ”‚ x3    โ”‚
โ”‚     โ”‚ Int64 โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 6     โ”‚ 10    โ”‚
โ”‚ 2   โ”‚ 7     โ”‚ 11    โ”‚

julia> parent(dfy)
4ร—4 DataFrame
โ”‚ Row โ”‚ x1    โ”‚ x2    โ”‚ x3    โ”‚ x4    โ”‚
โ”‚     โ”‚ Int64 โ”‚ Int64 โ”‚ Int64 โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ 5     โ”‚ 9     โ”‚ 13    โ”‚
โ”‚ 2   โ”‚ 2     โ”‚ 6     โ”‚ 10    โ”‚ 14    โ”‚
โ”‚ 3   โ”‚ 3     โ”‚ 7     โ”‚ 11    โ”‚ 15    โ”‚
โ”‚ 4   โ”‚ 4     โ”‚ 8     โ”‚ 12    โ”‚ 16    โ”‚

julia> parentindices(dfy)
(2:3, 2:3)

julia> r = dfy[2, :]
DataFrameRow
โ”‚ Row โ”‚ x2    โ”‚ x3    โ”‚
โ”‚     โ”‚ Int64 โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 3   โ”‚ 7     โ”‚ 11    โ”‚

julia> parent(r)
4ร—4 DataFrame
โ”‚ Row โ”‚ x1    โ”‚ x2    โ”‚ x3    โ”‚ x4    โ”‚
โ”‚     โ”‚ Int64 โ”‚ Int64 โ”‚ Int64 โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ 5     โ”‚ 9     โ”‚ 13    โ”‚
โ”‚ 2   โ”‚ 2     โ”‚ 6     โ”‚ 10    โ”‚ 14    โ”‚
โ”‚ 3   โ”‚ 3     โ”‚ 7     โ”‚ 11    โ”‚ 15    โ”‚
โ”‚ 4   โ”‚ 4     โ”‚ 8     โ”‚ 12    โ”‚ 16    โ”‚

julia> parentindices(r)
(3, 2:3)

So in summary - the answer to your question is to use parentindices and get a first dimension from it.

However, as in Base, you get a row index in the source DataFrame which is not probably what you wanted - is this correct?

(if yes - then please comment and we can think of changing the design)

For concreteness, this is the current behavior filtering on a subdataframe using rows.

julia> df = DataFrame(a = [1, 1, 2, 2],b = rand(4));

julia> gd = groupby(df, :a);

julia> combine(gd) do sdf
           filter(r -> getfield(r, :row) == 1, sdf)
       end
1ร—2 DataFrame
โ”‚ Row โ”‚ a     โ”‚ b       โ”‚
โ”‚     โ”‚ Int64 โ”‚ Float64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ 0.45063 โ”‚

This is expected now (apart from the fact that field access is not part of the public API and rather parentindices(r)[1] should be used). My question is - do we want to change it?

row_number is used in SQL and dplyr

So the approach would be to add another field in DataFrameRow that would remember this value and add an exported method rownumber that would return it. Do you think we should also store the reference to the direct parent of DataFrameRow (rather than the true parent - as we currently do)?

As long as we don't provide any public function to access the direct parent, storing it doesn't make sense. So the question is really: should parent(row) return the direct parent? That sounds logical, though it adds yet another field. But maybe it doesn't really matter for performance since DataFrameRow is quite slow already due to type instability.

parent(row) return the direct parent

It should not, as it is not consistent with Base. The design that we have now with parent and parentindices is made to match what Base does.

it adds yet another field. But maybe it doesn't really matter for performance since DataFrameRow is quite slow already due to type instability.

This was my thinking and that is why I have asked as we would have to add another field to return rownumber anyway.
Though we would need another function name for this, e.g. rowsource?

Just to add parent in Base guarantees not to return view and this is the invariant we keep.

also I would add rowsource only if we feel it is needed. I can see the need for rownumber, but maybe when someone works with DataFrameRow it is never needed to know rowsource as it is either obvious or not required?
The reason is that rownumber is Int and rowsource will add a parameter to DataFrameRow (leading to additional compilation passes in some casees and we already have 2 parameters there). Of course this is a minor issue, and if you feel rowsource is needed I will add it.

This wouldn't work in transform(df, AsTable(:) => ByRow(row) => "rownum"), right? Since that is a Tables.namedtupleiterator rather than a DataFrameRow

w.r.t rowsource if you are taking a row from a SubDataFrame I can't imagine a use-case where you need to know the row number of the original data frame.

This wouldn't work in transform(df, AsTable(:) => ByRow(row) => "rownum"), right?

Right

I can't imagine a use-case where you need to know the row number of the original data frame.

It is needed if you are writing a code that should be fast and want to avoid indirection of computing the actual row number in the true source. This is exactly how it is used now internally (and I guess this is the use case in Base for view in general)

The question is if you see a usecase of rowsource?

Sorry, I'm not quite sure what rowsource returns. The SubDataFrame?

No I don't see a use-case for that. I see a use case for rownumber for sure. But the fact that it wouldnt work in piping makes me wonder if its worth introducing an inconsistency

rowsource is not present yet. If we intorduced it it would return the direct data frame that DataFrameRow was created from.

There is no inconsistency. We will have parentindices which work like in Base and parent that also will work like in Base.

Then rownumber and rowsource would be new functions just in DataFrames.jl.

It is not useful in piping, but it is useful e.g. with eachrow (rownumber for sure; rowsource not that much as when you write eachrow(df) you are likely to know df :))

okay. My votes are thus no to rowsource, yes to rownumber and maybe yes to some future hack to get it to work for piping.

Yeah, rowsource sounds a little too strange to me. My vote is for rownumber as well.

Agreed, rownumber is the only function which is likely to be useful.

We could also provide a special object to get the row number without creating a DataFrameRow, e.g. transform(gd, RowNumber() => (n -> n ./ length(n)) => :rank).

Yes, then RowNumber() should be in general allowed in things like [:x1, RowNumber()]

Apart from RowNumber() which we keep track of in https://github.com/JuliaData/DataFrames.jl/issues/2328 is the conclusion here that we just add rownumber (we should add it before 1.0 as it will change memory layout of DataFrameRow).

we should add it before 1.0 as it will change memory layout of DataFrameRow.

I don't think we should treat internal fields as part of the public API. Otherwise our life is going to be quite difficult after 1.0. :-)

Agreed - still I would prefer to break such things as rarely as possible. Anyway - this seems to be a simple PR. Do we want to add rownumber function?

Something I realized is that with this addition, having parent(row) return the original parent doesn't make a lot of sense (apart from ensuring consistency with SubArray). It could be misleading and bug-prone if people call parent(row) and rownumber(row) expecting these to be the same. That's even more dangerous due to the fact that the discrepancy will only appear when using SubDataFrame, which isn't the most commonly tested pattern.

Maybe it would be better to accept the (limited) inconsistency with SubArray, and have parent(row) return the direct parent (storing it in an additional field)? SubArray doesn't have the same problem as we do, as it's less frequent to ask what's the index of a value in the direct parent; also there performance matters more than for DataFrameRow, which is slow anyway.

This was my concern originally and that is why:

  • I have added in the PR clear examples showing that this is not the case (a small help but still ...)
  • I have asked above if we want to be able to get a parent of DataFrameRow. I would not change how parent works as it is combined with parentindices (which refers to the DataFrame where data is actually stored). So maybe we should add another function like rowsource (the name I proposed above) that would return a direct parent.

I am not sure what is best (fortunately this feature is not pressing to add, and we can think about the best option).

(incidentally - I would expect parent and parentindices to work exactly as they work now even if I did not know DataFrames.jl - but probably I am in a minority, because probably most users do not think too much about what parent does in Base :smile:)

Let's go with that, but then we should probably avoid advertising parent, as it's only useful for low-level code. If we find ourselves recommending parent (as we did lately to get back the original data frame after applying transform on a SubDataFrame subset of rows), we should introduce a function like rowsource. Though it would be nice to find a more general name that also makes sense for SubDataFrame.

This doesn't have to block adding rownumber though.

So let us move forward with #2356 as it is unaffected.

The comment about parent is extremely relevant I think. We should be very clear (and add more documentation on it) that parent always go down to DataFrame (actually in transform! on SubDataFrame this is what people will want to do anyway) and going to the direct source of the of the view from which it was created. I would discuss what is best name for it here.

Maybe sourcedataframe? Not very short, but explicit and I believe that it will not be required often. To support this we would have to add new fields to SubDataFrame and DataFrameRow though. So the question is - if we believe that there is high risk that people use parent(dataframerow) and try to use rownumber(dataframerow) on it instead parentindices(dataframerow)[1]?

actually in transform! on SubDataFrame this is what people will want to do anyway

Why? Imagine a function which calls sdf = filter(f, adf,view=true); transform!(sdf ...) on its input adf::AbstractDataFrame and does something with adf after that. If the code was written in piping style instead, and called parent instead of reusing adf, it would get the wrong object if adf happened to be a SubDataFrame (which is the caller's decision).

Maybe sourcedataframe? Not very short, but explicit and I believe that it will not be required often. To support this we would have to add new fields to SubDataFrame and DataFrameRow though. So the question is - if we believe that there is high risk that people use parent(dataframerow) and try to use rownumber(dataframerow) on it instead parentindices(dataframerow)[1]?

I'm not sure the risk is high, I just think we should stop advertizing parent because it doesn't do what people probably expect it to do. Regarding the name, I'm not a fan of sourcedataframe as it repeats the type in the function name. But it's hard to find a good name that wouldn't be quite broad...

Indeed in the piping workflow for view it might be good to be able to get a direct source.

Maybe let us call it viewsource? It is general, but does not seem likely to be used elsewhere? Also it would apply both to DataFrameRow and SubDataFrame (which are both technically views).

Finally - for completeness - the question is if we see any use of an equivalent of parentindices but related to source, which could be called e.g. sourceindices. I would prefer not to store it, as it would be expensive to compute in some cases, but let us think, if it would be of any use?

Indeed in the piping workflow for view it might be good to be able to get a direct source.

Maybe let us call it viewsource? It is general, but does not seem likely to be used elsewhere? Also it would apply both to DataFrameRow and SubDataFrame (which are both technically views).

It would be useful to see in which cases this function is most likely to be used, to ensure that the name is intuitive in that context.

Finally - for completeness - the question is if we see any use of an equivalent of parentindices but related to source, which could be called e.g. sourceindices. I would prefer not to store it, as it would be expensive to compute in some cases, but let us think, if it would be of any use?

Ah, right. It's annoying because we can't compute it after the fact like with GroupedDataFrame fields. Hopefully it's not needed in practice...

this function is most likely to be used

the only real use is in piping after e.g. transform I think.

can't compute it after the fact

we can compute it dynamically each time on request if needed (but it will be expensive also)

we can compute it dynamically each time on request if needed (but it will be expensive also)

Ah, OK, so we could compute it the first time it's used and store the result. Anyway we don't need to implement that, it's just theoretically possible.

Yes we could, as assuming we store the source we have two mappings:

  1. source to parent
  2. this object to parent

and thus it is theoretically possible to compute the mapping between this object and source using this :).

But I would not store it as it would mean views would have to become mutable struct and we do not want this.

When #2356 is merged this issue will be closed. Let us keep discussing adding viewsource and sourceindices functions in #2371 as this is a separate issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jangorecki picture jangorecki  ยท  7Comments

garborg picture garborg  ยท  8Comments

abieler picture abieler  ยท  7Comments

bkamins picture bkamins  ยท  8Comments

bbrunaud picture bbrunaud  ยท  3Comments