Currently if you do df[idx, n] .= x you will be doing a broadcasted assignment to a copy of the selected part of the dataframe. See this post on discourse.
It seems to me that this should probably assign to the original dataframe, as most users would probably expect that behavior (I've been using Julia a couple of years now, and it was not immediately obvious to me what was happening here). Apparently this can be done by extending Base.dotview.
If someone decides to take this on, it might also be worth thinking carefully about the broadcast behavior on DataFrames more generally, as I suspect there are a few other "gotchas" lurking.
Yes - in general whole getindex/setindex! infrastructure in DataFrames.jl should be carefully reviewed to sync it with Julia 1.0. I have started it some time ago (https://github.com/JuliaData/DataFrames.jl/pull/1380) but waited till things stabilize. Probably this is a good time to go back to it if someone wants to make a PR (if not I might get back to it but not immediately).
I think it migth call for some breaking changes. As I was saying in the discourse thread, really df[idx, n] .= x should work and df[idx, n] = x should throw an error unless x matches the shape of df[idx, n]. This would be consistent with the behavior of AbstractArray.
Right. That is why I have stopped #1380 in spring to make sure we are consistent. There are also other cases to clean up (e.g. handling of binary indexing). In general this topic is complex and tricky so I did not want to rush.
Yes, this will definitely be a bit tricky. I think we should try to get to it soon though as we don't want to bake in weird custom behavior for DataFrames long after the release of 1.0.
I have started collecting up the target behavior in #1380. Next week I will try to clean it up and propose a consistent upgrade that is:
Another related thing which is annoying/inconsistent with the current behavior is that df[:x] = 1 mutates the column vector while df[:x] = 1:n replaces it. It would be much more logical to have df[:x] .= 1 mutate, and df[:x] = 1 create a new vector (or throw an error). See also https://github.com/JuliaStats/DataFramesMeta.jl/issues/108#issuecomment-419495875.
Another issue here:
julia> df = DataFrame(a = [1], b = [2])
julia> df[1, :] = [5, 6]
ERROR: MethodError: Cannot `convert` an object of type Array{Int64,1} to an object of type Int64
Closest candidates are:
convert(::Type{T<:Number}, ::T<:Number) where T<:Number at number.jl:6
convert(::Type{T<:Number}, ::Number) where T<:Number at number.jl:7
convert(::Type{T<:Integer}, ::Ptr) where T<:Integer at pointer.jl:23
...
Stacktrace:
[1] setindex!(::Array{Int64,1}, ::Array{Int64,1}, ::Int64) at ./array.jl:769
[2] insert_single_entry!(::DataFrame, ::Array{Int64,1}, ::Int64, ::Int64) at /Users/peterdeffebach/.julia/dev/DataFrames/src/dataframe/dataframe.jl:367
[3] setindex! at /Users/peterdeffebach/.julia/dev/DataFrames/src/dataframe/dataframe.jl:469 [inlined]
[4] setindex!(::DataFrame, ::Array{Int64,1}, ::Int64, ::Colon) at /Users/peterdeffebach/.julia/dev/DataFrames/src/dataframe/dataframe.jl:623
[5] top-level scope at none:0
I guess this was a design decision and the assignment you want should be done like this:
df[1, :] = DataFrame([5 6])
because your assignment has the following meaning now:
julia> df = DataFrame(a = Any[1], b = Any[2])
1ร2 DataFrame
โ Row โ a โ b โ
โ โ Any โ Any โ
โโโโโโโผโโโโโโผโโโโโโค
โ 1 โ 1 โ 2 โ
julia> df[1, :] = [5, 6]
2-element Array{Int64,1}:
5
6
julia> df
1ร2 DataFrame
โ Row โ a โ b โ
โ โ Any โ Any โ
โโโโโโโผโโโโโโโโโผโโโโโโโโโค
โ 1 โ [5, 6] โ [5, 6] โ
of course the question is if we want to keep this approach.
And this change would be conflicting with:
julia> df[[1], :] = [5, 6]
ERROR: DimensionMismatch("array could not be broadcast to match destination")
which is expected to work columnwise.
In general - as we now start treating DataFrame as a collection of rows maybe this should be rethinked, but OTOH it will be breaking so I am not sure of the benefit.
I would vote to re-think this while we are re-thinking other set index methods. The automatic broadcasted getindex method producing
โ Row โ a โ b โ
โ โ Any โ Any โ
โโโโโโโผโโโโโโโโโผโโโโโโโโโค
โ 1 โ [5, 6] โ [5, 6] โ
is strange and I would be surprised if many people use it.
Maybe the rule should be that any section of a DataFrame of size (n, m) should be able to be imputed with an array of size (n, m)?
Although df[1, :] = DataFrame([5 6]) isn't too hard to remember. Maybe there are performance issues with an intermediate copy?
Arguably df[1, :] = [5, 6] should be written as df[1, :] .= [5, 6] if the goal is to set all entries to [5, 6]. So we could deprecate this (at least for vectors) so that we df[1, :] = x can set the full row when x is a vector, a tuple or a named tuple.
Maybe the rule should be that any section of a DataFrame of size
(n, m)should be able to be imputed with an array of size(n, m)?
Yeah, why not.
Although
df[1, :] = DataFrame([5 6])isn't too hard to remember. Maybe there are performance issues with an intermediate copy?
Yes, it should be much slower, so that really shouldn't be recommended.
We should probably also allow
df[1, :] = (b=6, a=5)
Which should set the right columns by name
See https://github.com/JuliaData/DataFrames.jl/pull/1646. This is the intention (though now I see we could be more precise in : case that we iterate names).
In general it is a good time, if you have some to spare, to review https://github.com/JuliaData/DataFrames.jl/pull/1646, as this will be the next big thing after "ownership" that we will implement before DataFrames.jl 1.0.
Will do
Broadcasting assignment is done in 0.19, so closing it. Some of the setindex! functionality is now not done in 0.19 due to deprecation (but this is a separate issue and now is specified so we only need to implement it now).
Most helpful comment
I have started collecting up the target behavior in #1380. Next week I will try to clean it up and propose a consistent upgrade that is: