This has been kicking around in my head for a while, but a SubDataFrame should allow creation of new columns.
Current behavior is:
julia> df = DataFrame(
id = rand(["a", "b", "c", "d"], 100),
a = rand(100),
b = 100 * randn(100),
c = rand([100, 200, 300], 100)
);
julia> sdf = view(df, 1:50, :);
julia> sdf[:, :d] = rand(nrow(sdf))
ERROR: ArgumentError: column name :d not found in the data frame; existing most similar names are: :a, :b, :c and :id
I propose we add a convenience function if, to enable workflow as follows (with magittr pipes for convenience)
df %>%
if(:a => ByRow(!missing)) %>%
transform(:b => (t -> t .- mean(x)) =`> :d)
The transform function only knows about rows where :a is not missing. This replicates the behavior of Stata's if syntax.
I do not know Stata well enough. Can you show a minimal example of expected input and output please? I understand that rows that do not meet the predicate are filled with missing or with some definable sentinel value?
I was thinking about it today. If we added it would you also request this to be added to DataFrameRow and both for setindex! and broadcasting? All this is technically doable but I would be reluctant to add this functionality without a strong reason.
Therefore it would be great if you gave some use-cases justifying the need, and in particular showing that just using broadcasted ifelse or a comprehension to create that column would not be enough.
I was thinking about it today. If we added it would you also request this to be added to
DataFrameRowand both forsetindex!and broadcasting?
I would not advocate it for DataFrameRow but I would like it for setindex!.
In Stata, it is very common to add if clauses to the ends of commands such that you only modify a subset of rows.
gen income_low = income if filled_out_income_form & income < 10
egen income_demeaned_low = income - mean(income) if !missing(form_id)
replace income = 100 if income > 100 // only replace a subset of observations
The first command will copy over values in column income to income_low, but only for observations that satisfy these conditions. missing is filled in elsewhere. The second command's mean function only applies to observations that are not missing form_id and fills in missing values similarly.
This is a great syntax. Stata users love it. It reads very nicely like a sentence. "Generate this variable if it satisfies these conditions"
Not only is this feature useful for end users, it's very easy to implement for programmers. When writing a stata program, the opening syntax works like this
program define myfun
syntax varlist [if], option(string)
marksample touse // "to use"
gen t = 1 if `touse'
end
The command marksample touse creates a variable touse which is equal to 1 if the if conditions in the function call are satisfied and 0 otherwise. Now the writer of the function can just add if qualifiers to everything they do and not worry about problems with other observations. But no rows are lost because an actual subset is never performed.
Why this is good syntax:
transform!, etc. exclusively means it is easier to work with large datasets where you don't know which observations and values you will need later on. Here is a very rough outline for how it might be implemented.
Add a field to SubDataFrame showing if it can be modified or not. The default is no and can only be modified if it is constructed via a function I will call ifdataframe for now. Constructors will look like
ifdataframe(df, :x => (t -> t .> mean(t))
All of the Pair{Symbol, Function} arguments return boolean arrays.
modfy!(df::AbstractDataFrame) and add new columns as needed. modify!(ifdataframe(df, :x => (t -> t .> mean(t))) will only modify columns with above-average values of :x. transform! could be one such function. This is a feature that both @jmboehm and @matthieugomez have both requested, without stating it as formally.
Tbh, I donβt think this is first order (ifelse does the same thing albeit less elegant). Maybe doing a specific type where() similarly to groupby() would be nice, but itβs more of an upgrade than a fundamental change.
Supporting this could make sense, but if we allow it it should work on any SubDataFrame. No need to introduce special cases. Though maybe convenience methods could be added to view to allow using the same syntax as filter to choose which rows should be retained.
Currently SubDataFrame supports all that is requested here except two cases:
Also I think immediately after we add this we will get a request for two additional features:
if statement withing group in GroupedDataFrame processing contextif statement for filtering groups in GroupedDataFrame processing context (and filtered-out groups get all missing)Therefore in the long term I would rather think of appropriate kwargs to select et al. functions, but I would leave this for later as this is mostly convenience functionality.
Still in this thread we can try work out the API.
So Iβve been thinking about that a bit more. Here is an idea for a syntax that builds on the recent changes (to be clear: its not a request ;))
# create a new column equal to mean of x by id for rows for which x> 1
@> df by(:id) where(:x => x -> x > 1) transform!(:x => mean)
# return a DataFrame with the mean of x by id for rows for which x>1
@> df by(:id) where(:x => x -> x > 1) combine(:x => mean)
The idea is that by and where are clauses.
by and where is loose, but it typically means the following: methods accepting claused DataFrames do something within groups (by) or/and for certain rows (where). transform) return DataFrames without clauses.The design is of course inspired by Stata and SQL where where (if in Stata) and by clauses can be applied to any method.
A nice thing about the current design (https://github.com/JuliaData/DataFrames.jl/pull/2214) is that these clauses donβt have to live in DataFrames, so one could try them in an outside package first.
Following up on Matthieu's idea, it seems to me a large class of operations on tabular data df_1,..., df_k can be described as a triple of
Split-apply-combine, reshapes, merges etc are all examples of these. Of course the above is so general that it's probably useless to think of in this generality. Still, the question is where you want to provide an interface to reduce the dimensionality of the above.
Hence: should by and where be the only clauses? Why have these two if the meaning varies across methods? My pet peeve is, for example, that I never remember whether if in Stata applies to the assignment or to the subset of the dataframe/group...
We are converging here towards a full:
SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition
ORDER BY column_name(s);
combo :smile: (of course it would be nice to have it all covered). Probably as @matthieugomez it would be good to experiment with it in some side package (or a PR that would not be intended to be merged) so that the syntax could be worked out without rushing.
# create a new column equal to mean of x by id for rows for which x> 1 @> df by(:id) where(:x => x -> x > 1) transform!(:x => mean) # return a DataFrame with the mean of x by id for rows for which x>1 @> df by(:id) where(:x => x -> x > 1) combine(:x => mean)
Yes this is exactly my proposal.
I should add that I don't think what you describe is that different from the proposal for grouped transforms. A grouped data frame is, after all, a data frame with some information about row indexes for each group attached to it.
A sub data frame is just a dataframe with row-indices attached to it.
Composability is nice, though. I will make sure to explore this more thoroughly after Finals and probably after Quals as well.
@pdeffebach yes, without where itβs exactly what https://github.com/JuliaData/DataFrames.jl/pull/2214 already does.
The main difference with your proposal is to use a type different from SubDataFrame (and, in particular, that does not inherit from AbstractDataFrame)
A sub data frame is just a dataframe with row-indices attached to it.
also column indices
yes, i am missing the functionality of adding/modifying a column on a subset of a dataframe only. I think it's really important. I think a good reference here (alternative to the magrittr pipe mentioned above) is the R data.table setup. The simple subsetting with [ is really clean IHMO. Not saying that's the way to go, but it's a reference.
> library(data.table)
data.table 1.12.8 using 4 threads (see ?getDTthreads). Latest news: r-datatable.com
> dt = data.table(a = 1:5, b = rnorm(5))
> dt
a b
1: 1 1.2576032
2: 2 1.6769647
3: 3 1.9672758
4: 4 -0.4590235
5: 5 -2.0694431
> dt[ a > 3, newcol := b + 2 ] # do where (a>3) what (add newcol)
> dt
a b newcol
1: 1 1.2576032 NA
2: 2 1.6769647 NA
3: 3 1.9672758 NA
4: 4 -0.4590235 1.54097647
5: 5 -2.0694431 -0.06944313
what you write is currently supported, but with a different syntax:
julia> dt = DataFrame(a=1:5, b=randn(5))
5Γ2 DataFrame
β Row β a β b β
β β Int64 β Float64 β
βββββββΌββββββββΌββββββββββββ€
β 1 β 1 β -0.142721 β
β 2 β 2 β 0.888727 β
β 3 β 3 β -1.39633 β
β 4 β 4 β 0.617181 β
β 5 β 5 β 0.974018 β
julia> dt[!, :b] = ifelse.(dt.a .> 3, dt.b .+ 2, missing)
5-element Array{Union{Missing, Float64},1}:
missing
missing
missing
2.6171813485634132
2.9740178167229954
julia> dt
5Γ2 DataFrame
β Row β a β b β
β β Int64 β Float64? β
βββββββΌββββββββΌβββββββββββ€
β 1 β 1 β missing β
β 2 β 2 β missing β
β 3 β 3 β missing β
β 4 β 4 β 2.61718 β
β 5 β 5 β 2.97402 β
what we do not allow is for a view to mutate the parent (in particular a view can have column :b omitted, but its parent might have had it)
Oh I see! that's good then. sorry about the noise.
You are welcome. Note also that with ifelse you have a choice of sentinel, as in some cases missing is not wanted, but you might e.g. prefer nothing or 0.
Given #2329 if you have some comments after some time has passed, please add your opinion. Thank you!
I think I may try my hand at an IfDataFrame that functions as a mutable view. We will see how much success this gets. I will copy over the SubDataFrame code and change what's needed.
I think I may try my hand at an
IfDataFrame
Actually, if we see the value in allowing the view to modify parents, I would prefer not to add a new object that is very similar to the one we have.
I understand the rules would be:
SubDataFrame as it works nowSubDataFrame (i.e. you are not allowed to push! and append! to it - as come columns that are not in a view might not allow missing)SubDataFrame IF this column name does not exist in parent, then this column:Union{Missing, RequestedType} and would fill with missing the values in rows that are not selectedsetindex!, and broadcast! implementations that would handle this case (`broadcast! will be hardest) keeping the rule we have now:Note that
sdf[!, col] = v,sdf[!, cols] = vandsdf.col = vare not allowed as sdf can be only modified in-place.
Is this what you think would be good?
If yes - and you go for implementing it - the key challenge is handling of SubIndex, so that after adding a column to a view this view gets the requested column added. As a particular case note that you use : to create a view then it uses Index not SubIndex so the situation is different (simpler).
Actually - maybe we want to allow this operation only if SubDataFrame has ALL columns selected with : (this is probably the use case that we have in practice)? If this is the case then the implementation will be much easier, but we will need to carefully document it.
Having thought more about it I am leaning even more to allow adding columns only in view(df, row_selector, :). We even could then define insertcols! for this case also.
Then we would have to highlight more how different is view(df, row_selector, :) from view(df, row_selector, col_selector) in the documentation, but it is already very different so it is OK to have this distincion.
The general idea is that when you write view(df, row_selector, :) you explicitly say you want to subset only rows (and already currently such view gets resized if df is resized), while view(df, row_selector, col_selector) asks for slicing, so then it is natural to expect that it does not allow adding columns (and it does not get resized if you change the parent).
Thanks for this! I think this is what I want. I hadn't thought about the push! and append! cases, but you are right.
One major reason for this logic is that it means we wouldn't have to do as much work in #2258. We could overload transform for a SubDataFrame and it would "just work".
This would also mean that our "piping" functions are less special than working normally with DataFrame objects. cc @matthieugomez about this.
As mentioned earlier in the thread, I think I would still prefer a WhereDataFrame type that disappears after
a transform/select/combine call.
For instance,
transform!(where(df, :x => >(1), x => mean(x))
transform!(where(df, :x => x -> !ismissing(x)), x => mean(x))
would return the original (parent) DataFrame.
But maybe having such a functionality would be too complex.
On Thu, Aug 6, 2020 at 2:57 PM pdeffebach notifications@github.com wrote:
Thanks for this! I think this is what I want. I hadn't thought about the
push! and append! cases, but you are right.One major reason for this logic is that it means we wouldn't have to do as
much work in #2258
https://github.com/JuliaData/DataFrames.jl/issues/2258. We could
overload transform for a SubDataFrame and it would "just work".This would also mean that our "piping" functions are less special than
working normally with DataFrame objects. cc @matthieugomez
https://github.com/matthieugomez about this.β
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/JuliaData/DataFrames.jl/issues/2211#issuecomment-670211867,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABPPPXLFED6AXGP4FY5W4WLR7MRNNANCNFSM4MQ44KJQ
.
Another option, if we only need this functionality in piping and select!/transform! etc. would be to add where kwarg to therm as discussed earlier so you would write:
transform!(df, where = :x => >(1), :x => mean)
transform!(df, where = :x => !ismissing, :x => mean)
the question is if we want to also be able to do such things in imperative style, so:
dfv = @view df[some_rows, :]
dfv[:, newcol] .= 1 # and :newcol has 1 for some_rows and missing in other places
If we need it only for piping style I feel I would prefer to use a kwarg for this in the functions that we want to support it.
I don't think we only need this functionality in piping. I think that this reasoning would be useful using getindex and setindex as well. That's why I propose this, so we can get transform "for free".
This solution also strikes me as very "julian" since it relies on method overloading and is similar to our use of groupby.
So as you have commented in the other thread to support getindex and setindex! we would need another type (actually we could extend GroupedDataFrame to be able to handle this use case, but I think it would be a mental overkill).
So it would have to be FilteredDataFrame (or WhereDataFrame), it would be created with filter with view=true and follow the logic we have described here (in particular being automatically unwrapped after transformations).
I would propose to think about it a bit and decide if users will have an easy time understanding how it works or not. Probably writing some more code snippets what would show how it should be used would be nice.
In particular let me stress what tension we have here:
SubDataFrame for this - we can provide all you want and it is a small change, but it will get a DataFrame with filtered rows in transform/select and after transform!/select! you will still get a SubDataFrame in resultFilteredDataFrame for this - this is much more work, all would behave the same as in point 1., except that it would return a DataFrame (either a new one or a parent) updated with all rows. Also in this case we would have to decide what behavior we want to have if FilteredDataFrame is called on SubDataFrame as it is a corner case we should decide what to do with; in particular note that currently SubDataFrame allows for duplicate rows in the view)I agree that adding new type of data frame view is probably overkill. I'd rather recommend people to use in-place operations like transform!, and instead of piping use the original data frame, like this:
df = DataFrame(...)
transform!(filter(:x => >(1), df, view=true), :x => mean)
df
...
Of course if you prefer piping you can add parent() in the series of operations.
Adding a where/filter keyword argument to select and transform is another option (that can be added at any time), but I'm a bit hesitant because it's hard to know where to stop: we would probably have to add groupby too, then maybe also sort, etc.
it's hard to know where to stop
That was my fear. Having this via view is a clean solution that can be handled in implementation without complicating the code with "special cases" (sorry for caring about it - but I also think about not having the code base so complicated that at some point no one will be able to understand how it works and we already have several places where it is really complicated, especially with GroupedDataFrame).
I really think a WhereDataFrame would be an awesome addition to the package. Beyond this issue, it would solve at least two other open issues:
filter is row-wise, and does not accept a DataFrame as a first argument (https://github.com/JuliaData/DataFrames.jl/issues/2323)filter and ifelse error on missing, which makes it hard to work with missing values (https://github.com/JuliaData/DataFrames.jl/issues/2314)Just to clarify the proposal, the function where (or similarly named) creates a WhereDataFrame from a DataFrame and a couple of conditions:
where(df, :x => .>=1, :y => .>=2))
In contrast with filter, conditions are column-wise by default, and rows that evaluate to missing are not kept (i.e. coalesce to false by default). WhereDataFrame does not inherit from DataFrame.
Then, the functions left to define on a WhereDataFrame are:
filter(wdf::WhereDataFrame) : return a DataFrame with only the selected rows
transform(wdf::WhereDataFrame, args...): return a DataFrame where only the selected rows are modified, and missing otherwise
combine(wdf::WhereDataFrame, args...): return the same thing as combine(filter(wdf), args...)
This does not need to happen now, since this proposal is not breaking, but I just wanted to say I think it's the right way forward to solve a number of issues we've been talking about recently.
I am OK with where as noted elsewhere (i.e. drop missing, col-wise by default, reverse argument order).
The question is if we need another type. An alternative we have been discussing is to allow SubDataFrame with : as column selector to play this role. The difference would be that it would work only for "in-place" functions: select!, transform!, setindex!, insertcols! and broadcasting assignment.
In this case we should just add a view kwarg to where just as proposed in https://github.com/JuliaData/DataFrames.jl/pull/2386 (the benefit of this design would be that all functions listed there + view could be used to "filter" the data frame and then pass it for transformations not just where).
Having a separate WhereDataFrame type has a benefit that you could define for it copying select and transform functions that would produce the whole original data frame but copied. But I am not sure this is that valuable - but maybe I am wrong. Let us discuss it.
Ah - now I see one difference of my approach and your approach when you would assign to an EXISTING column (as opposed to creating a new column).
SubDataFrame (my) approach: if you write into an existing column its contents in rows other than filtered remain unchanged AND no promotion of eltype of this column happens (so it is just like df[rows, col] .= 1 - it does not do promotion)WhereDataFrame (your) approach: if I read your proposal correctly if you write into an existing column it gets discarded, rows other than filtered get filled with missing and the eltype of the column is potentially altered (to be union of Missing andthe type you assign to the filtered part of the column)I am not sure which approach should be preferred in practice (maybe both.)
The latter behavior could be obtained with transform!(sdf, col => fun => col), which should replace column col with a new one (just like it currently does for DataFrame) rather than replace its contents in-place.
I think we should first see how far we can get with DataFramesMeta without introducing new types, and then we'll see whether it's still worth it.
The latter behavior could be obtained with
transform!(sdf, col => fun => col), which should replace column col with a new one (just like it currently does forDataFrame) rather than replace its contents in-place.
It could - there is no restriction how we do it. My major concern is if we want to have both behaviors or having one of them is enough (indeed the former can be obtained by just writing df[rows, col] = something without using views so maybe this is not needed).
But in general - if this is the case - then we have all we want with SubDataFrame.
Yeah it could, but I think it would be more consistent with DataFrame to replace the column.
Then - such SubDataFrame would be exactly WhereDataFrame that @matthieugomez wants.
@matthieugomez - do you see any differences?
Most helpful comment
You are welcome. Note also that with
ifelseyou have a choice of sentinel, as in some casesmissingis not wanted, but you might e.g. prefernothingor0.