Dataframes.jl: Add shuffle, shuffle! functions

Created on 9 Dec 2019  ยท  12Comments  ยท  Source: JuliaData/DataFrames.jl

Hi,

Would be helpful to see shuffle, shuffle! functions in DataFrames. Used in randomizing machine learning mini batches.

What do you think?

non-breaking

Most helpful comment

Now you can do shuffle via df[shuffle(axes(df, 1)), :] but I agree we could add it.

@nalimilan - given we have settled to treat a DataFrame as a collection of rows I think it is OK to add it. If you agree, then I can make a PR.

All 12 comments

Now you can do shuffle via df[shuffle(axes(df, 1)), :] but I agree we could add it.

@nalimilan - given we have settled to treat a DataFrame as a collection of rows I think it is OK to add it. If you agree, then I can make a PR.

Thanks, I didn't know about df[shuffle(axes(df, 1)), :]. I will start using that in the mean time.

A bit less efficient (but more aesthetic) way to do it is DataFrame(shuffle(eachrow(df))).

Maybe also consider offering column shuffling?

shuffle(;cols=false)

shuffle!(;cols=false)

We treat DataFrame as row oriented, so I would not implement column shuffling directly, rather this:

select(df, randperm(ncol(df)))

or this:

df[:, randperm(ncol(df))]

should be used

Reminds me of a similar discussion about sample. Maybe better leave this for post-1.0.

Shuffling columns doesn't sound too common, is it?

Also another pattern that can be used to shuffle columns is df[randperm(nrow(df)), :].

An in-place operation is more challenging and will require a careful design.

OK - leaving this decision post 1.0 (mostly because it is easy to do this without this function).

I haven't seen many column permutation examples, though I use it in my work. Appreciate the pointer on how to do it. When I'm deep in a language it is obvious. In this case I'm in multiple languages and frameworks and looking for convenience functions.

Sure. I guess the point of @nalimilan is that we want to move towards 1.0 pretty soon.

In general - as we try to look at DataFrame as a collection of rows now I would be OK with adding shuffle and sample to it now. But @nalimilan is a kind of "ecosystem curator" (as it has to be consistent) so I prefer to delegate the final word to him ๐Ÿ˜„.

I'd like to add a use case that is common in my work, for grouped dataframes. I want to shuffle the groups, which in my case consist of group of items with time series of transactions. Then I want to take the first N groups after shuffle (ie randomly select N groups).

Maybe there is a similarly simple way to shuffle the grouped df

The following process demonstrates the steps I'm currently taking:

df = DataFrame(time = [1, 2, 1, 2, 1, 2]
    , amt = [19.00, 11.00, 35.50, 32.50, 5.99, 5.99]
    , item = ["B001", "B001", "B020", "B020", "BX00", "BX00"])

6ร—3 DataFrame
โ”‚ Row โ”‚ time  โ”‚ amt     โ”‚ item   โ”‚
โ”‚     โ”‚ Int64 โ”‚ Float64 โ”‚ String โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ 19.0    โ”‚ B001   โ”‚
โ”‚ 2   โ”‚ 2     โ”‚ 11.0    โ”‚ B001   โ”‚
โ”‚ 3   โ”‚ 1     โ”‚ 35.5    โ”‚ B020   โ”‚
โ”‚ 4   โ”‚ 2     โ”‚ 32.5    โ”‚ B020   โ”‚
โ”‚ 5   โ”‚ 1     โ”‚ 5.99    โ”‚ BX00   โ”‚
โ”‚ 6   โ”‚ 2     โ”‚ 5.99    โ”‚ BX00   โ”‚

using StatsBase, Pipe
@pipe df |> groupby(_, :item) |>
         combine(_, :time, :amt, :item, :item => (x -> rand()) => :rando) |>
         sort(_, :rando) |>
         transform(_, :rando => denserank => :rnk_rnd)

6ร—5 DataFrame
โ”‚ Row โ”‚ item   โ”‚ time  โ”‚ amt     โ”‚ rando    โ”‚ rnk_rnd โ”‚
โ”‚     โ”‚ String โ”‚ Int64 โ”‚ Float64 โ”‚ Float64  โ”‚ Int64   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ BX00   โ”‚ 0     โ”‚ 5.99    โ”‚ 0.241881 โ”‚ 1       โ”‚
โ”‚ 2   โ”‚ BX00   โ”‚ 1     โ”‚ 5.99    โ”‚ 0.241881 โ”‚ 1       โ”‚
โ”‚ 3   โ”‚ B001   โ”‚ 0     โ”‚ 19.0    โ”‚ 0.292468 โ”‚ 2       โ”‚
โ”‚ 4   โ”‚ B001   โ”‚ 1     โ”‚ 11.0    โ”‚ 0.292468 โ”‚ 2       โ”‚
โ”‚ 5   โ”‚ B020   โ”‚ 0     โ”‚ 35.5    โ”‚ 0.70816  โ”‚ 3       โ”‚
โ”‚ 6   โ”‚ B020   โ”‚ 1     โ”‚ 32.5    โ”‚ 0.70816  โ”‚ 3       โ”‚

# I only want the original columns
 @pipe filter(:rnk_rnd => <=(2), res)  |>
         select(_, :item, :time, :amt)

4ร—3 DataFrame
โ”‚ Row โ”‚ item   โ”‚ time  โ”‚ amt     โ”‚
โ”‚     โ”‚ String โ”‚ Int64 โ”‚ Float64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ BX00   โ”‚ 1     โ”‚ 5.99    โ”‚
โ”‚ 2   โ”‚ BX00   โ”‚ 2     โ”‚ 5.99    โ”‚
โ”‚ 3   โ”‚ B020   โ”‚ 1     โ”‚ 35.5    โ”‚
โ”‚ 4   โ”‚ B020   โ”‚ 2     โ”‚ 32.5    โ”‚

Got it:

# take the first 2 shuffled groups
@pipe df |> groupby(_, :item) |>
    _[shuffle(1:end)] |>
    combine(_[1:2], :)

4ร—3 DataFrame
โ”‚ Row โ”‚ item   โ”‚ time  โ”‚ amt     โ”‚
โ”‚     โ”‚ String โ”‚ Int64 โ”‚ Float64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ BX00   โ”‚ 0     โ”‚ 5.99    โ”‚
โ”‚ 2   โ”‚ BX00   โ”‚ 1     โ”‚ 5.99    โ”‚
โ”‚ 3   โ”‚ B001   โ”‚ 0     โ”‚ 19.0    โ”‚
โ”‚ 4   โ”‚ B001   โ”‚ 1     โ”‚ 11.0    โ”‚

I guess i'll put it up on stack overflow.

Adding this and sample is planned but after 0.22 release as it is non-breaking.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

gustafsson picture gustafsson  ยท  6Comments

garborg picture garborg  ยท  8Comments

bkamins picture bkamins  ยท  7Comments

cossio picture cossio  ยท  5Comments

bbrunaud picture bbrunaud  ยท  3Comments