with OffsetArrays, constructing a DataFrame has a behavior I did not expect:
julia> DataFrame(timeperiod=OffsetArray([t for t in 0:3], 0:3))
11ร1 DataFrame
โ Row โ timeperiod โ
โ โ Int64 โ
โโโโโโโผโโโโโโโโโโโโโค
โ 1 โ 1 โ
โ 2 โ 2 โ
โ 3 โ 3 โ
โ 4 โ #undef โ
I expected:
โ Row โ timeperiod โ
โ โ Int64 โ
โโโโโโโผโโโโโโโโโโโโโค
โ 1 โ 0 โ
โ 2 โ 1 โ
โ 3 โ 2 โ
โ 4 โ 3 โ
So, there are two issues here. First, we try to be Tables.jl compliant, and that package specifies in https://juliadata.github.io/Tables.jl/dev/#Tables.columns that:
A retrieved column is an object that is indexable and has a known length, i.e. supports
length(col)andcol[i] for any i = 1:length(col).
and your vector does not meet this predicate.
Now why the operation you perform does not fail:
DataFrame(timeperiod=OffsetArray([t for t in 0:3], 0:3)).timeperiod returns you the original vector correctlyIn summary - I do not think we can change it, but probably we should add a note in the documentation that DataFrames.jl does not support vectors with custom indexing.
CC @quinnj, @nalimilan - maybe you have something more to add.
I think we can fix this in Tables.jl/DataFrames.jl; basically, we're requiring that "column" objects implement the indexing interface; so we just need to make sure we use firstindex/lastindex/eachindex when indexing columns.
But then df[rows, :col] would produce a different result than df.col[rows] (or equivalently df[!, :col][rows]). That is why I am hesitant to do this. The assumption that both forms are equivalent is made within DataFrames.jl and also probably in user code.
For DataFrames.jl supporting this is doable but it is a very big change (dozens of functions will be potentially affected). If we wanted to go this way anyway first Tables.jl should be updated and released. Then if someone is willing to add the support for this in DataFrames.jl I can help in identifying the places that need changing (a quick list to review: getindex, setindex!, broadcasting, view, SubDataFrame, DataFrameRow, printing functions, reshaping functions, joining, split-apply-combine, sorting, Tables.jl integration)
Yes, handling this correctly will require checking lots of things, and the indexing inconsistency you point will remain.
@alecloudenback Can you describe the use case for using OffsetArrays in a DataFrame?
I have a model where there's a distinct "time zero" so I was trying to use offset arrays so I could index model results by time 0:n_periods
So all columns would have had the same offset? Would you have expected df[i, col] to give the same result as df.col[i]?
Yes, all columns would be constructed with the same offset. I think I would still expect "row 1" to be the first value in the source array (ie index 0 in the above example). But I just thought that I should be able to construct a dataframe with a set of equal length arrays, even if they didn't start at 1. I think constructing the dataframe and indexing a dataframe from something other than zero are different, though related questions.
But I just thought that I should be able to construct a dataframe with a set of equal length arrays, even if they didn't start at
1.
You are able to do it, and it gets constructed correctly.
The only thing that is incorrect is indexing into such a data frame (and in particular show does not handle indexing correctly so it displays the data frame incorrectly, although the data frame itself stores the values correctly)
I guess we could allow what @alecloudenback describes, if somebody is willing to fix the places where 1-based indexing is assumed.
OK, so the decision is:
We will accept a PR that:
DataFrame they all use the same indexing rangeDataFrame into indices of the non-standard vectors by their index in eachindex of these vectorsProvided that this change will not lead to a deterioration of the performance of indexing of DataFrames.jl and will produce consistent behavior across the whole package (behavior of view is most important to check here I think).
This change is breaking, but since it is a rarely used functionality we will treat it as a minor breakage and allow it to be incorporated in 1.x releases.
in this case we will map the indices of a
DataFrameinto indices of the non-standard vectors by their index ineachindexof these vectors
I think what @alecloudenback proposed was that index 1 of the data frame would point to the first value in the vectors, whatever their index. Is that right @alecloudenback?
This is exactly what I have written. if you write
df[1, :col]
then it logically gets translated (not directly of course as it would be inefficient) to
df.col[collect(eachindex(df.col))[1]]
The only challenge with this is that eachindex does not guarantee that what it returns is indexable (the contract is that it is iterable only), but we probably can assume we allow only indexable eachindex, in which case:
df.col[eachindex(df.col)[1]]
would be enough.
I started looking into this for Tables.jl and ran into this awkwardness as well: is it safe to assume you can do eachindex(col)[i]? It almost feels like there should be a toindex(A, i) function that directly translates the requested input index to the actual index of the underlying array. Maybe @timholy or @mbauman could chime in if they're more aware of what can be relied on here for OffsetArray integration?
Hmmm, it seems this approach won't work:
julia> inds = eachindex(OffsetArray([t for t in 0:2], 0:2))
0:2
julia> typeof(inds)
OffsetArrays.IdOffsetRange{Int64,Base.OneTo{Int64}}
julia> inds[1]
1
which makes sense; eachindex returns an iterator of indices for the _OffsetArray_, whereas we're looking for some way to to translate "row number" of a table to the correct ordinal index of the OffsetArray. OffsetArrays has parentindex which does what we want, but it isn't exported or probably a common operation for other types of Arrays that may employ exotic indexing techniques.
Thank you for investigating into this - makes sense. So it seems that some common API would be needed first before we move forward with this PR as in general I assume that parentindex gives an index in parent, which does not have to be 1-based with step 1 either.