Now you can do shuffle via df[shuffle(axes(df, 1)), :] but I agree we could add it.
@nalimilan - given we have settled to treat a DataFrame as a collection of rows I think it is OK to add it. If you agree, then I can make a PR.
Thanks, I didn't know about df[shuffle(axes(df, 1)), :]. I will start using that in the mean time.
A bit less efficient (but more aesthetic) way to do it is DataFrame(shuffle(eachrow(df))).
Maybe also consider offering column shuffling?
shuffle(;cols=false)
shuffle!(;cols=false)
We treat DataFrame as row oriented, so I would not implement column shuffling directly, rather this:
select(df, randperm(ncol(df)))
or this:
df[:, randperm(ncol(df))]
should be used
Reminds me of a similar discussion about sample. Maybe better leave this for post-1.0.
Shuffling columns doesn't sound too common, is it?
Also another pattern that can be used to shuffle columns is df[randperm(nrow(df)), :].
An in-place operation is more challenging and will require a careful design.
OK - leaving this decision post 1.0 (mostly because it is easy to do this without this function).
I haven't seen many column permutation examples, though I use it in my work. Appreciate the pointer on how to do it. When I'm deep in a language it is obvious. In this case I'm in multiple languages and frameworks and looking for convenience functions.
Sure. I guess the point of @nalimilan is that we want to move towards 1.0 pretty soon.
In general - as we try to look at DataFrame as a collection of rows now I would be OK with adding shuffle and sample to it now. But @nalimilan is a kind of "ecosystem curator" (as it has to be consistent) so I prefer to delegate the final word to him ๐.
I'd like to add a use case that is common in my work, for grouped dataframes. I want to shuffle the groups, which in my case consist of group of items with time series of transactions. Then I want to take the first N groups after shuffle (ie randomly select N groups).
Maybe there is a similarly simple way to shuffle the grouped df
The following process demonstrates the steps I'm currently taking:
df = DataFrame(time = [1, 2, 1, 2, 1, 2]
, amt = [19.00, 11.00, 35.50, 32.50, 5.99, 5.99]
, item = ["B001", "B001", "B020", "B020", "BX00", "BX00"])
6ร3 DataFrame
โ Row โ time โ amt โ item โ
โ โ Int64 โ Float64 โ String โ
โโโโโโโผโโโโโโโโผโโโโโโโโโโผโโโโโโโโโค
โ 1 โ 1 โ 19.0 โ B001 โ
โ 2 โ 2 โ 11.0 โ B001 โ
โ 3 โ 1 โ 35.5 โ B020 โ
โ 4 โ 2 โ 32.5 โ B020 โ
โ 5 โ 1 โ 5.99 โ BX00 โ
โ 6 โ 2 โ 5.99 โ BX00 โ
using StatsBase, Pipe
@pipe df |> groupby(_, :item) |>
combine(_, :time, :amt, :item, :item => (x -> rand()) => :rando) |>
sort(_, :rando) |>
transform(_, :rando => denserank => :rnk_rnd)
6ร5 DataFrame
โ Row โ item โ time โ amt โ rando โ rnk_rnd โ
โ โ String โ Int64 โ Float64 โ Float64 โ Int64 โ
โโโโโโโผโโโโโโโโโผโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโค
โ 1 โ BX00 โ 0 โ 5.99 โ 0.241881 โ 1 โ
โ 2 โ BX00 โ 1 โ 5.99 โ 0.241881 โ 1 โ
โ 3 โ B001 โ 0 โ 19.0 โ 0.292468 โ 2 โ
โ 4 โ B001 โ 1 โ 11.0 โ 0.292468 โ 2 โ
โ 5 โ B020 โ 0 โ 35.5 โ 0.70816 โ 3 โ
โ 6 โ B020 โ 1 โ 32.5 โ 0.70816 โ 3 โ
# I only want the original columns
@pipe filter(:rnk_rnd => <=(2), res) |>
select(_, :item, :time, :amt)
4ร3 DataFrame
โ Row โ item โ time โ amt โ
โ โ String โ Int64 โ Float64 โ
โโโโโโโผโโโโโโโโโผโโโโโโโโผโโโโโโโโโโค
โ 1 โ BX00 โ 1 โ 5.99 โ
โ 2 โ BX00 โ 2 โ 5.99 โ
โ 3 โ B020 โ 1 โ 35.5 โ
โ 4 โ B020 โ 2 โ 32.5 โ
Got it:
# take the first 2 shuffled groups
@pipe df |> groupby(_, :item) |>
_[shuffle(1:end)] |>
combine(_[1:2], :)
4ร3 DataFrame
โ Row โ item โ time โ amt โ
โ โ String โ Int64 โ Float64 โ
โโโโโโโผโโโโโโโโโผโโโโโโโโผโโโโโโโโโโค
โ 1 โ BX00 โ 0 โ 5.99 โ
โ 2 โ BX00 โ 1 โ 5.99 โ
โ 3 โ B001 โ 0 โ 19.0 โ
โ 4 โ B001 โ 1 โ 11.0 โ
I guess i'll put it up on stack overflow.
Adding this and sample is planned but after 0.22 release as it is non-breaking.
Most helpful comment
Now you can do
shuffleviadf[shuffle(axes(df, 1)), :]but I agree we could add it.@nalimilan - given we have settled to treat a
DataFrameas a collection of rows I think it is OK to add it. If you agree, then I can make a PR.