It would be great to be able to use filter with a view settings so I can cheaply extract a smaller df from an existing, larger, df, using some boolean test.
filter(boolean_test_per_row, df, view = true)
Likewise, one could also split a df by grouping:
d = DataFrame(a = [1,1,2,2,3,3], b = 1:6, c = rand(6))
gd = groupby(d, :a)
group_1 = gd[(a = 1,)]
group_1.b # works!
But what if the grouping results in multiple groups? How can we refer and act on that?
groups = gd[[(a = 1,), (a = 2,)]]
groups.b
@yakir12 - can you please comment on the functionality for GroupedDataFrames you have requested on Zoom here also? Thank you!
Sorry, I might have given you the wrong impression, the example with the grouping is what I meant. The main idea is that there are two main mechanisms to extract a df from a df: filter or groupby. Making it work nicely for both ways is what this issue is about.
OK - so what we give with groupby is enough and you want to improve filter right?
I think it is a good addition, as filter accepts any function and groupby requires grouping column to be already present in a processed form.
But groupby works as long as you index into only one group in the grouping columns. Which makes sense, cause the result is one group. But when indexing into multiple groups, it doesn't "work" as nicely:
d = DataFrame(a = [1,1,2,2,3,3], b = 1:6, c = rand(6))
gd = groupby(d, :a)
groups = gd[[(a = 1,), (a = 2,)]]
groups.b # error
How I see it (as a user) is that filter applies per row, the result of which is always a dataframe. Even when this df contains only one row:
julia> filter(row -> row.b == 1, d)
1ร3 DataFrame
โ Row โ a โ b โ c โ
โ โ Int64 โ Int64 โ Float64 โ
โโโโโโโผโโโโโโโโผโโโโโโโโผโโโโโโโโโโโค
โ 1 โ 1 โ 1 โ 0.256398 โ
Indexing into a groupby could result in a SubDataFrame or GroupedDataFrame depending on if the indexing hit more than one group (this reminds me of the whole debate about taking vectors seriously...).
While filter copies, and that's something I understand you'd like to allow the user to choose (with view), it does make a lot of sense to want to work with a group or a set of groups as a whole (e.g. I want to work only on apples from my fruit table, cause I have an analysis that only makes sense when applied to apples and no other fruit). So rather than:
apples = filter(row -> row.fruit == "apple", df)
I could:
gdf = groupby(df, :fruit)
df2 = gdf[(fruit = "apple",)]
Again, this would work, but not if I happen to want to index into more than just one fruit.
Sorry if I'm just repeating the same nonsense.
I think we need a general mechanism to return views instead of copies. This affects filter but also stack (with the unexported stackview) and probably other functions too. The simplest solution would be to standardize on a view keyword argument as you suggest. Probably other packages have similar needs so it would make sense to standardize.
stack exactly follows the API you describe, it has a view kwarg.
Most helpful comment
stackexactly follows the API you describe, it has aviewkwarg.