Dataframes.jl: OffsetArray Compatibility

Created on 18 Feb 2020 · 15Comments · Source: JuliaData/DataFrames.jl

with OffsetArrays, constructing a DataFrame has a behavior I did not expect:

julia> DataFrame(timeperiod=OffsetArray([t for t in 0:3], 0:3))
11×1 DataFrame
│ Row │ timeperiod │
│     │ Int64      │
├─────┼────────────┤
│ 1   │ 1          │
│ 2   │ 2          │
│ 3   │ 3          │
│ 4   │ #undef     │

I expected:

│ Row │ timeperiod │
│     │ Int64      │
├─────┼────────────┤
│ 1   │ 0          │
│ 2   │ 1          │
│ 3   │ 2          │
│ 4   │ 3          │

breaking help wanted

Source

alecloudenback

All 15 comments

So, there are two issues here. First, we try to be Tables.jl compliant, and that package specifies in https://juliadata.github.io/Tables.jl/dev/#Tables.columns that:

A retrieved column is an object that is indexable and has a known length, i.e. supports length(col) and col[i] for any i = 1:length(col).

and your vector does not meet this predicate.

Now why the operation you perform does not fail:

upon creation/adding of the column we only check the length of the vector
when trying to display the vector we index into it using numbers 1,2,3 ... so what you see is expected.
DataFrame(timeperiod=OffsetArray([t for t in 0:3], 0:3)).timeperiod returns you the original vector correctly

In summary - I do not think we can change it, but probably we should add a note in the documentation that DataFrames.jl does not support vectors with custom indexing.

CC @quinnj, @nalimilan - maybe you have something more to add.

bkamins on 18 Feb 2020

I think we can fix this in Tables.jl/DataFrames.jl; basically, we're requiring that "column" objects implement the indexing interface; so we just need to make sure we use firstindex/lastindex/eachindex when indexing columns.

quinnj on 18 Feb 2020

👍1

But then df[rows, :col] would produce a different result than df.col[rows] (or equivalently df[!, :col][rows]). That is why I am hesitant to do this. The assumption that both forms are equivalent is made within DataFrames.jl and also probably in user code.

For DataFrames.jl supporting this is doable but it is a very big change (dozens of functions will be potentially affected). If we wanted to go this way anyway first Tables.jl should be updated and released. Then if someone is willing to add the support for this in DataFrames.jl I can help in identifying the places that need changing (a quick list to review: getindex, setindex!, broadcasting, view, SubDataFrame, DataFrameRow, printing functions, reshaping functions, joining, split-apply-combine, sorting, Tables.jl integration)

bkamins on 18 Feb 2020

Yes, handling this correctly will require checking lots of things, and the indexing inconsistency you point will remain.

@alecloudenback Can you describe the use case for using OffsetArrays in a DataFrame?

nalimilan on 18 Feb 2020

I have a model where there's a distinct "time zero" so I was trying to use offset arrays so I could index model results by time 0:n_periods

alecloudenback on 18 Feb 2020

So all columns would have had the same offset? Would you have expected df[i, col] to give the same result as df.col[i]?

nalimilan on 18 Feb 2020

Yes, all columns would be constructed with the same offset. I think I would still expect "row 1" to be the first value in the source array (ie index 0 in the above example). But I just thought that I should be able to construct a dataframe with a set of equal length arrays, even if they didn't start at 1. I think constructing the dataframe and indexing a dataframe from something other than zero are different, though related questions.

alecloudenback on 18 Feb 2020

But I just thought that I should be able to construct a dataframe with a set of equal length arrays, even if they didn't start at 1.

You are able to do it, and it gets constructed correctly.

The only thing that is incorrect is indexing into such a data frame (and in particular show does not handle indexing correctly so it displays the data frame incorrectly, although the data frame itself stores the values correctly)

bkamins on 18 Feb 2020

I guess we could allow what @alecloudenback describes, if somebody is willing to fix the places where 1-based indexing is assumed.

nalimilan on 25 Feb 2020

OK, so the decision is:

We will accept a PR that:

makes sure that if non-standard indexed vectors are in DataFrame they all use the same indexing range
in this case we will map the indices of a DataFrame into indices of the non-standard vectors by their index in eachindex of these vectors

Provided that this change will not lead to a deterioration of the performance of indexing of DataFrames.jl and will produce consistent behavior across the whole package (behavior of view is most important to check here I think).

This change is breaking, but since it is a rarely used functionality we will treat it as a minor breakage and allow it to be incorporated in 1.x releases.

bkamins on 25 Feb 2020

in this case we will map the indices of a DataFrame into indices of the non-standard vectors by their index in eachindex of these vectors

I think what @alecloudenback proposed was that index 1 of the data frame would point to the first value in the vectors, whatever their index. Is that right @alecloudenback?

nalimilan on 25 Feb 2020

👍1

This is exactly what I have written. if you write
df[1, :col]
then it logically gets translated (not directly of course as it would be inefficient) to
df.col[collect(eachindex(df.col))[1]]

The only challenge with this is that eachindex does not guarantee that what it returns is indexable (the contract is that it is iterable only), but we probably can assume we allow only indexable eachindex, in which case:
df.col[eachindex(df.col)[1]]
would be enough.

bkamins on 25 Feb 2020

👍1

I started looking into this for Tables.jl and ran into this awkwardness as well: is it safe to assume you can do eachindex(col)[i]? It almost feels like there should be a toindex(A, i) function that directly translates the requested input index to the actual index of the underlying array. Maybe @timholy or @mbauman could chime in if they're more aware of what can be relied on here for OffsetArray integration?

quinnj on 25 Feb 2020

Hmmm, it seems this approach won't work:

julia> inds = eachindex(OffsetArray([t for t in 0:2], 0:2))
0:2

julia> typeof(inds)
OffsetArrays.IdOffsetRange{Int64,Base.OneTo{Int64}}

julia> inds[1]
1

which makes sense; eachindex returns an iterator of indices for the _OffsetArray_, whereas we're looking for some way to to translate "row number" of a table to the correct ordinal index of the OffsetArray. OffsetArrays has parentindex which does what we want, but it isn't exported or probably a common operation for other types of Arrays that may employ exotic indexing techniques.

quinnj on 26 Feb 2020

Thank you for investigating into this - makes sense. So it seems that some common API would be needed first before we move forward with this PR as in general I assume that parentindex gives an index in parent, which does not have to be 1-based with step 1 either.

bkamins on 26 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings