Dataframes.jl: split a df into parts

Created on 8 May 2020  ยท  6Comments  ยท  Source: JuliaData/DataFrames.jl

It would be great to be able to use filter with a view settings so I can cheaply extract a smaller df from an existing, larger, df, using some boolean test.

filter(boolean_test_per_row, df, view = true)

Likewise, one could also split a df by grouping:

d = DataFrame(a = [1,1,2,2,3,3], b = 1:6, c = rand(6))
gd = groupby(d, :a)
group_1 = gd[(a = 1,)]
group_1.b # works!

But what if the grouping results in multiple groups? How can we refer and act on that?

groups = gd[[(a = 1,), (a = 2,)]]
groups.b
feature non-breaking

Most helpful comment

stack exactly follows the API you describe, it has a view kwarg.

All 6 comments

@yakir12 - can you please comment on the functionality for GroupedDataFrames you have requested on Zoom here also? Thank you!

Sorry, I might have given you the wrong impression, the example with the grouping is what I meant. The main idea is that there are two main mechanisms to extract a df from a df: filter or groupby. Making it work nicely for both ways is what this issue is about.

OK - so what we give with groupby is enough and you want to improve filter right?

I think it is a good addition, as filter accepts any function and groupby requires grouping column to be already present in a processed form.

But groupby works as long as you index into only one group in the grouping columns. Which makes sense, cause the result is one group. But when indexing into multiple groups, it doesn't "work" as nicely:

d = DataFrame(a = [1,1,2,2,3,3], b = 1:6, c = rand(6))
gd = groupby(d, :a)
groups = gd[[(a = 1,), (a = 2,)]]
groups.b # error

How I see it (as a user) is that filter applies per row, the result of which is always a dataframe. Even when this df contains only one row:

julia> filter(row -> row.b == 1, d)
1ร—3 DataFrame
โ”‚ Row โ”‚ a     โ”‚ b     โ”‚ c        โ”‚
โ”‚     โ”‚ Int64 โ”‚ Int64 โ”‚ Float64  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ 1     โ”‚ 0.256398 โ”‚

Indexing into a groupby could result in a SubDataFrame or GroupedDataFrame depending on if the indexing hit more than one group (this reminds me of the whole debate about taking vectors seriously...).
While filter copies, and that's something I understand you'd like to allow the user to choose (with view), it does make a lot of sense to want to work with a group or a set of groups as a whole (e.g. I want to work only on apples from my fruit table, cause I have an analysis that only makes sense when applied to apples and no other fruit). So rather than:

apples = filter(row -> row.fruit == "apple", df)

I could:

gdf = groupby(df, :fruit)
df2 = gdf[(fruit = "apple",)]

Again, this would work, but not if I happen to want to index into more than just one fruit.

Sorry if I'm just repeating the same nonsense.

I think we need a general mechanism to return views instead of copies. This affects filter but also stack (with the unexported stackview) and probably other functions too. The simplest solution would be to standardize on a view keyword argument as you suggest. Probably other packages have similar needs so it would make sense to standardize.

stack exactly follows the API you describe, it has a view kwarg.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tlienart picture tlienart  ยท  8Comments

cossio picture cossio  ยท  5Comments

garborg picture garborg  ยท  8Comments

pdeffebach picture pdeffebach  ยท  8Comments

mattBrzezinski picture mattBrzezinski  ยท  5Comments