Currently, to have the equivalent (or something similar) of the SQL having
clause you need to write a [.data.table
first using by
and then feed the result into the i
parameter of a second [.data.table
, like in:
dt <- data.table(id = rep(1:2, each = 2),
var = c(0.2, 0.5, 1.5, 1.3))
dt[dt[, mean(var) > 1, by = id]$id]
id var
1: 2 1.5
2: 2 1.3
Another option is to use conditional statement inside j
, very powerful, I do all the time, and so far there is nothing that the current syntax did not allow me to do. However having a having
parameter I believe will allow writing much more clear and readable codes. For example the above can be written as:
dt[, if(mean(var) > 1) .SD, by = id]
What I propose is something like:
dt[, .SD, by = id, having = mean(var) > 1]
The idea is to have an expression that always evaluates to a logical of length 1 which would tell whether or not j
has to be evaluated for the current group.
Thanks,
Michele
Great FR. I've been pondering about this use case for quite a while as well. We can do this without the additional argument like so:
dt[, .SD[mean(var)>1], by=id]
(But for speed, this'll need optimisation of .SD[.]
internally - #735.)
It's most likely this case that we resort to .I
instead:
dt[dt[, .I[mean(var) > 1], by=id]$V1]
And it'd be great to get this directly (even better if we can achieve it without having
) - maybe if j
expression evaluates to a 1-column logical vector? Just thinking out loud.
Hi Arun. Thanks for the answer. Once optimisation of .SD
will be available then this would be just a matter of 'taste' in terms of what is clearer to read between:
dt[, .SD[mean(var)>1], by=id]
and
dt[, .SD, by = id, having = mean(var) > 1]
Even though the second may also have a better appeal for people coming from other languages (SQL in particular). But again, this might be just my opinion. Maybe I just using SQL too much in the last period (lol).
As far as the taste part goes - I really dislike adding an extra param, when it can be accomplished with simple and standard syntax (i.e. the first option above).
Curious. I was sure that you were that one most likely to appreciate this :-) (considering how much you wanted to eliminate by-without-by, mainly to improve readability, especially for people coming from other languages, if I remember correctly). Anyway, I know the two are quite different scenarios. I just wanted to share my point of view,:
[.data.table
doesn't really harmj
for a particular groupThe reasons I didn't like the silent by-without-by and "having" are actually the same - I don't like to remember extra stuff, whether it be extra params or extra strange behavior.
I would argue that the first expression you wrote is much easier to read, because you don't have to keep reading the line, then discover that some new param is specified, and have to go back to the beginning of the sentence and reevaluate your mental model of what's going on.
What do you think of not adding having
arg to [
, but turning it into having()
function and making it works in i
argument, the same as order()
works, e.g.:
dt[ having(var > 1), .(var = mean(var)), by = id ]
# would perform below without additional copy:
dt[, .(var = mean(var)), by = id ][ var > 1 ]
having
would be a function to evaluate its argument in frame of dt
and provide filtering to i
.
I think this FR is closely tied with https://github.com/Rdatatable/data.table/issues/1269 "Returning only groups." I often want to get groups with some attribute and store them in a vector, like my_teams
in this SO post. Here's the relevant line:
my_teams <- FantasyTeams[, max(table(Team)) <= 3, by=team_no][(V1)]$team_no
# or
my_teams <- FantasyTeams[, if ( max(table(Team)) <= 3 ) 1, by=team_no]$team_no
With the having
and "Returning only groups" FRs, this could be something like
my_teams <- FantasyTeams[, .(), by = team_no, having = { max(table(Team)) <= 3 }]$team_no
The code is just as long, but I prefer it, so I don't have to read j
carefully to understand the objective.
Another example from SO. The goal is to overwrite the Value
column with 3L
if some by-group level condition holds:
DT = setDT(structure(list(Ind = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), ID = c("A",
"A", "A", "A", "B", "B", "B", "B"), RegionStart = c(1L, 101L,
1L, 101L, 1L, 101L, 1L, 101L), RegionEnd = c(100L, 200L, 100L,
200L, 100L, 200L, 100L, 200L), Value = c(3L, 2L, 3L, 2L, 3L,
2L, 5L, 5L), TN = c("N", "N", "T", "T", "N", "N", "T", "T")), .Names = c("Ind",
"ID", "RegionStart", "RegionEnd", "Value", "TN"), row.names = c(NA,
-8L), class = "data.frame"))
# current syntax
DT[, Value := {
fixit = ( Value[TN=="N"] != 3L ) & ( uniqueN(Value) == 1L )
if (fixit) 3L else Value
}, by=.(ID, RegionStart)]
# with "having"
DT[,
Value := 3L
, by=.(ID, RegionStart)
, having={ ( Value[TN=="N"] != 3L ) & ( n_distinct(Value) == 1L ) }]
Besides arguably nicer syntax, I guess the having=
way might also be more efficient, since only a subset of by-groups need to be modified. The most efficient way without having=
probably looks like...
myeyes = DT[, .I[ ( Value[TN=="N"] != 3L ) & ( uniqueN(Value) == 1L )], by=.(ID, RegionStart)]$V1
DT[ myeyes, Value := 3L]
# or
mygs = DT[, ( Value[TN=="N"] != 3L ) & ( uniqueN(Value) == 1L ), by=.(ID, RegionStart)][(V1)][, V1 := NULL]
DT[ mygs, Value := 3L, on=names(mygs)]
which are rather convoluted.
Edit: And another example to update if/when this feature is available: http://stackoverflow.com/q/36292702
(2016/4/26:) http://stackoverflow.com/q/36869784
(2016/06/16:) http://stackoverflow.com/q/37855013/
Another example from SO. It could be used to select strictly unique rows (related to #1163 ):
DT = setDT(structure(list(id = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1,
2, 3, 4), dsp = c(5, 6, 7, 8, 6, 6, 7, 8, 5, 6, 9, 8, 5, 6, 7,
NA), status = c(FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE,
TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE)), .Names = c("id",
"dsp", "status"), row.names = c(NA, -16L), class = "data.frame"))
# my current way to select "strictly unique" rows
Bigdt[, .N, by=names(Bigdt)][N == 1][, N := NULL][]
# could be...
Bigdt[, .SD, by=names(Bigdt), having ={.N == 1L}]
Note that Bigdt[, if (.N == 1L) .SD, by=names(Bigdt)]
does not work here, since .SD
is empty. Maybe that could be helped by #1269 , though.
And another from SO: http://stackoverflow.com/q/38272608/ They want to select groups based on stuff in the last row, so having =
heath condition[.N] == "not healthy"
should do it.
And another simple case (filtering by size): http://stackoverflow.com/q/39085450/
And another, with an anti join:
ID <- c("A","A","A","B","B","C","D")
Value <- c(0,1,2,0,2,0,0)
df <- data.frame(ID,Value)
library(data.table)
setDT(df)
# use j = max() to get GForce speedup
df[ !df[, max(Value), by=ID][V1 > 0, .(ID, Value = 0)], on=.(ID, Value)]
# do the more standard thing, if j = if (...) x
df[ !df[, if (max(Value) > 0) .(Value = 0), by=ID], on=.(Value, ID) ]
# desired syntax
df[ !df[, .(Value = 0), by=ID, having = max(Value) > 0], on=.(Value, ID) ]
Not such a great example, though.
And another with an answer like dt[, if(uniqueN(time)==1L) .SD, by=name, .SDcols="time"]
And another: http://stackoverflow.com/q/43354165/
And another: http://stackoverflow.com/q/43613087/
Another (though it might get deleted): http://stackoverflow.com/q/43635968/
Another http://stackoverflow.com/a/43765352/
Another http://chat.stackoverflow.com/transcript/message/37148860#37148860
Un autre https://stackoverflow.com/q/45557011/
Haiyou https://stackoverflow.com/questions/45598397/filter-data-frame-matching-all-values-of-a-vector
Um mais https://stackoverflow.com/a/45721286/
lingwai yige https://stackoverflow.com/a/45820567/
and https://stackoverflow.com/q/46251221/
uno mas https://stackoverflow.com/questions/46307315/show-sequences-that-include-a-variable-in-r
tambem https://stackoverflow.com/q/46638058/
And another. I want to subset my data.table (myDT) to entries that aren't found in a reference table (idDT):
library(data.table)
idDT = data.table(id = 1:3, v = c("A","B","C"))
myDT = data.table(id = 3:4, z = c("gah","egad"))
# my attempt
idDT[myDT, on=.(id), .SD[.N == 0L], by=.EACHI]
# Empty data.table (0 rows) of 2 cols: id,v
# workaround
myDT[, .SD[idDT[.SD, on=.(id), .N == 0, by=.EACHI]$V1]]
# desired notation (with having=)
myDT[, .SD, by = id, having = idDT[.BY, on=.(id), .N]==0L]
This would be inefficient, though, since my desired notation entails each by= value making a separate join to idDT. In that sense, maybe it's not the best example.
mais um https://stackoverflow.com/questions/47765283/r-data-table-group-by-where/47765308?noredirect=1#comment82524998_47765308 could do DT[, if (any(status == "A") && !any(status == "B")) .SD, by=id]
or with a having parameter DT[, .SD, by=id, having = any(status == "A") && !any(status == "B")]
and then https://stackoverflow.com/a/48669032/ m[, if(isTRUE(any(passed))) .SD, by=id]
should be m[by = id, having = isTRUE(any(passed))]
mais um exemplo https://stackoverflow.com/q/49072250/
ein anderer https://stackoverflow.com/a/49211292/ stock_profile[, sum(Value), by=Pcode, having=any(Location=="A" & NoSales == "Y")][, sum(V1)]
mais um https://stackoverflow.com/a/49366998/
autre https://stackoverflow.com/a/49919015/
y https://stackoverflow.com/questions/50257643/deleting-rows-in-r-with-value-less-than-x
moar https://stackoverflow.com/q/54582048
e https://stackoverflow.com/q/56283005
keep groups if .N==k (also many at the dupe target) https://stackoverflow.com/questions/56794306/only-get-data-table-groups-with-a-given-number-of-rows
keep groups if any(diff(sorted_col)) <= threshold https://stackoverflow.com/q/57512417
keep if max(x) < threshold https://stackoverflow.com/a/57698641
@eantonya IMHO, adding the having
parameter will actually make it easier to remember. Excessive conciseness can be hard to memorize. Moreover, making data.table
more like SQL isn't a bad idea.
In data.table
FAQ:
2.16 I’ve heard that data.table syntax is analogous to SQL.
Yes : ...
@ywhuofu data.table already accepts order
function to i
argument, which is what base R user would expect. The same way we translate sql _ORDER_ to i = order(...)
we can do with _HAVING_. It fits well, as i
in data.frame is used for subsetting (_having_ is just delayed subsetting after aggregation) or reordering.
Could this be the API?
dt <- data.table(id = rep(1:2, each = 2),
var = c(0.2, 0.5, 1.5, 1.3))
dt[having.i(mean(var) > 1, by = id)]
id var
1 2 1.5
2 2 1.3
I have implemented this version although it sets a restriction of only using gforce
optimized functions as well as some functions that do not depend on groupings (e.g., +
, |
, &
, etc.). In SQL, summary functions are what are used although I would understand if Cdogroups
would want to be supported.
One additional note. It seems like it would be difficult to fit in dt[having(var > 3), .(var = mean(x)), by = .(grp)]
within the current '[.data.table'
code. There would need to be some checks to make sure the syntax were correct.
```
n = 1e6
grps = 1e5
head_n = 2L
dt = data.table::data.table(x = sample(grps, n, TRUE), y = runif(n))
expression min median
1 lw[having.i(.N < 2L | sum(y) > 11 | median(y) < 0.7, by = x)] 114.13ms 124.98ms
2 dt[dt[, .I[.N < 2L | sum(y) > 11 | median(y) < 0.7], by = x]$V1] 4000ms 4000ms
expression min median itr/sec
mem_alloc gc/sec
n_itr
1 lw[having.i(.N < 2L, by = x)] 30.2ms 35.3ms 27.9 8.02MB 3.99 14
2 dt[dt[, .I[.N < 2L], by = x]$V1] 106.1ms 110.4ms 8.81 6.13MB 10.6 5
I would prefer this as an added parameter, named either as having=
or group_filter=
(or something else that doesn't rely on SQL awareness to know what it does on sight).
e.g. I think it would be confusing to combine row filters in i
with group-level filters also in i
Would having =
work on a subset of the data or would you only be able to use the i
arg or the having
arg? I also assume that having
would occur prior to the j
being evaluated. How would .BY
and .GRP
and soon .NGRP
work with ```having = ````?
There are not many syntactic choices:
having
i
, j
, by
.If row filter and group filter are both needed, dt[row_selector & group_selector, ...]
does not look right therefore it seems that for such use case, row filter and group filter should not appear in the same argument. Then i
is ruled out.
Then there won't be many syntactic choices.
Leveraging by
could make it look confusing. For example,
dt[, .SD, by = having(.(id), mean(var > 1))]
dt[, .SD, by = id ~ mean(var) > 1]
Adding special function to j
does not look good.
dt[, having(mean(var) > 1, .SD), by = id]
Now, the code I find look best to me is the most original version
dt[, if (mean(var) > 1) .SD, by = id]
dt[, if (mean(var) > 1) .(x = sum(x), y = sum(y)), by = id]
What I really want is keep the optimization done after the group filtering. Can we detect theif
expression in j
and optimize into it such as keeping GForce to work inside if statement?
@renkun-ken Or overload another infix operator?
dt[, mean(var) > 1 ? .SD, by=id]
One advantage of a special symbol over if
is that there's no chance the user put a matching else
later.
@franknarf1 It seems that while we are trying to detect if
in j
, we could also check if
has else if
and else
. We could optimize if
-only case and leave if-else
un-optimized. Maybe later we could handle if-else
case too. Personally I still prefer optimizing code to overriding or leveraging too much on existing operators.
@franknarf1 this is cool C syntax, although not sure if it wouldn't complicate to much here.
var > 1 ? d : e
could work as well, isn't it?
var > 1 ? d : e
looks concise but only works for inline simple case since d
and e
could be anything like {...}
and the operator precedence may be confusing. Are we only trying to allow .SD
to perform pure group filtering, or any expression in j
here?
Adding syntax has a problem that user needs to be aware that the syntax is specially handled and should not work inside j
. For example, user may expect
dt[, mean(var) > 1 ? 0 : (sd(var) < 1 ? 1 : 0), by = id]
to work, and even
dt[, mean(var) > 1 ? 0 : 1]
dt[, mean(var) > 1 ? 0 : (sd(var) < 1 ? 1 : 0)]
to work in general.
I'm a bit confused here.
Does
dt[, .SD, by = id, having = mean(var) > 1]
have any advantage over
dt[, if(mean(var) > 1) .SD, by = id]
since mean(var) > 1
will always be evaluated for each group. Does it only serve as a syntactic sugar or we are trying to optimize over this somehow to have higher performance?
@jangorecki
@franknarf1 this is cool C syntax, although not sure if it wouldn't complicate to much here.
var > 1 ? d : e
could work as well, isn't it?
Yeah, that would be cool. Operator precedence might get in the way without {}
s as @renkun-ken pointed out (ex = quote(x & y ? a+b : v+w); str(rapply(as.list(ex), as.list, how="replace"))
)
I'm a bit confused here.
Does
dt[, .SD, by = id, having = mean(var) > 1]
have any advantage over
dt[, if(mean(var) > 1) .SD, by = id]
since
mean(var) > 1
will always be evaluated for each group. Does it only serve as a syntactic sugar or we are trying to optimize over this somehow to have higher performance?
I guess until now I had preferred having=
because I find it a little clearer to read and imagine it's easier to maintain as compared to adding further syntactical magic to j
. On the other hand, I think I might instead prefer the j
syntax magic, since
if () ...
already; and like the ?
way too if it's feasible.j
, then no additional questions need to be answered about its behavior (eg, DT[, x := if (cond) y, by=id]
creates NAs if the condition is met in some groups but not others and this behavior shouldn't need to be re-explained for having=
).Regarding optimization, it seems like there are a lot of examples where the having condition itself could benefit from some version of GForce, since it usually is an expression like max(x) > 0
, max(x) == 0
.
For my own use, besides the optimization, I guess it would be mostly useful for the return-only-groups case mentioned above https://github.com/Rdatatable/data.table/issues/1269
> dt[, if (mean(var) > 1) .(), by=id]
> # instead of ...
> dt[, mean(var) > 1, by=id][V1 == TRUE, !"V1"]
id
1: 2
Nice points Frank. in addition to the huge compendium of use cases you've
built (thanks again btw!).
it may in fact be easier to do GForce in the having= version since we can
just apply the gforce logic to having similar to j rather than trying to do
NSE to accomplish the same.
though that may interact w Jan's WIP to move a lot of j code to C -- any
thoughts there Jan?
On Sat, Feb 15, 2020, 1:40 PM Frank notifications@github.com wrote:
@jangorecki https://github.com/jangorecki
@franknarf1 https://github.com/franknarf1 this is cool C syntax,
although not sure if it wouldn't complicate to much here.
var > 1 ? d : e could work as well, isn't it?Yeah, that would be cool. Operator precedence might get in the way without
{}s as @renkun-ken https://github.com/renkun-ken pointed out (ex =
quote(x & y ? a+b : v+w); str(rapply(as.list(ex), as.list, how="replace"))
)I'm a bit confused here.
Does
dt[, .SD, by = id, having = mean(var) > 1]
have any advantage over
dt[, if(mean(var) > 1) .SD, by = id]
since mean(var) > 1 will always be evaluated for each group. Does it only
serve as a syntactic sugar or we are trying to optimize over this somehow
to have higher performance?I guess until now I had preferred having= because I find it a little
clearer to read and imagine it's easier to maintain as compared to adding
further syntactical magic to j. On the other hand, I think I might
instead prefer the j syntax magic, since
- I am used to if () ... already; and like the ? way too if it's
feasible.- If it is integrated in j, then no additional questions need to be
answered about its behavior (eg, DT[, x := if (cond) y, by=id] creates
NAs if the condition is met in some groups but not others and this behavior
shouldn't need to be re-explained for having=).Regarding optimization, it seems like there are a lot of examples where
the having condition itself could benefit from some version of GForce,
since it usually is an expression like max(x) > 0, max(x) == 0.For my own use, besides the optimization, I guess it would be mostly
useful for the return-only-groups case mentioned above #1269
https://github.com/Rdatatable/data.table/issues/1269dt[, if (mean(var) > 1) .(), by=id]
instead of ...
dt[, mean(var) > 1, by=id][V1 == TRUE, !"V1"]
id
1: 2—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/788?email_source=notifications&email_token=AB2BA5OCN4IW3N6QQJU6RJ3RC555BA5CNFSM4ATSQPMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL3CK7A#issuecomment-586556796,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AB2BA5MD7ZXWSRRHVEJM6C3RC555BANCNFSM4ATSQPMA
.
j code to be moved to C is the code that is responsible for column selection only, so guessing with
argument. Won't interfere here.
Since the FR is for add a 'having' parameter..., the word having
should be somewhere in the solution. Optimization for ternary operators seems like a separate issue.
My preference for having.i()
is because of the mantra of data.table: subset / order in i, select in j, group in by. having
is just a special case of subsetting.
Regardless, if there is a new argument having
, would the API support an i
arg? Most use cases do not seem to need that requirement.
What should be the behavior of ordering? That is, most of the current approaches automatically re-order:
library(data.table)
dt = data.table(grp = c(1L, 2L, 1L, 2L), x = letters[sample(4L)])
dt
#> grp x
#> <int> <char>
#> 1: 1 a
#> 2: 2 b
#> 3: 1 c
#> 4: 2 d
dt[dt[, .I[.N > 0L], by = grp]$V1]
#> grp x
#> <int> <char>
#> 1: 1 a
#> 2: 1 c
#> 3: 2 b
#> 4: 2 d
Should the having
arg return a result that is re-ordered according to by
?
What should be the behavior of ordering?
Should thehaving
arg return a result that is re-ordered according toby
?
@ColeMiller1 Fwiw, I would expect having=
to only appear when by=
also appears, so the results would be grouped as in your example with ...$V1
.
Yes I would expect the ordering to be consistent:
DT[i, j, by, having]
# < == >
DT[i, if (having) j, by]
I think there was no agreement on API, particularly on having new having
argument in [
. @mattdowle wdyt?
Current approach of using DT[, if (.N > 1L) .SD, col1]
is good one, not really complicated, easy to extend, but little bit more difficult to optimise.
My idea was to use having
as function call in i
: DT[having(N > 1L), .N, col1]
, but then it is not possible to provide normal subsetting to i
.
Alternatively new arg could be a sub-argument of by
, haven't much thought about it but something like DT[, .N, by=.(col1, .having = N > 1L)]
, so the extra grouping-related argument is encapsulated into by
argument. This is the proper way to scale up number of arguments.
Most helpful comment
Another example from SO. It could be used to select strictly unique rows (related to #1163 ):
Note that
Bigdt[, if (.N == 1L) .SD, by=names(Bigdt)]
does not work here, since.SD
is empty. Maybe that could be helped by #1269 , though.And another from SO: http://stackoverflow.com/q/38272608/ They want to select groups based on stuff in the last row, so
having =
heath condition[.N] == "not healthy"
should do it.And another simple case (filtering by size): http://stackoverflow.com/q/39085450/
And another, with an anti join:
Not such a great example, though.
And another with an answer like
dt[, if(uniqueN(time)==1L) .SD, by=name, .SDcols="time"]
And another: http://stackoverflow.com/q/43354165/
And another: http://stackoverflow.com/q/43613087/
Another (though it might get deleted): http://stackoverflow.com/q/43635968/
Another http://stackoverflow.com/a/43765352/
Another http://chat.stackoverflow.com/transcript/message/37148860#37148860
Another https://stackoverflow.com/questions/45464333/assign-a-binary-vector-based-on-blocks-of-data-within-another-vector/
Another https://stackoverflow.com/questions/32259620/how-to-remove-unique-entry-and-keep-duplicates-in-r/32259758#32259758
Un autre https://stackoverflow.com/q/45557011/
Haiyou https://stackoverflow.com/questions/45598397/filter-data-frame-matching-all-values-of-a-vector
Um mais https://stackoverflow.com/a/45721286/
lingwai yige https://stackoverflow.com/a/45820567/
and https://stackoverflow.com/q/46251221/
uno mas https://stackoverflow.com/questions/46307315/show-sequences-that-include-a-variable-in-r
tambem https://stackoverflow.com/q/46638058/
And another. I want to subset my data.table (myDT) to entries that aren't found in a reference table (idDT):
This would be inefficient, though, since my desired notation entails each by= value making a separate join to idDT. In that sense, maybe it's not the best example.
mais um https://stackoverflow.com/questions/47765283/r-data-table-group-by-where/47765308?noredirect=1#comment82524998_47765308 could do
DT[, if (any(status == "A") && !any(status == "B")) .SD, by=id]
or with a having parameterDT[, .SD, by=id, having = any(status == "A") && !any(status == "B")]
and then https://stackoverflow.com/a/48669032/
m[, if(isTRUE(any(passed))) .SD, by=id]
should bem[by = id, having = isTRUE(any(passed))]
mais um exemplo https://stackoverflow.com/q/49072250/
ein anderer https://stackoverflow.com/a/49211292/
stock_profile[, sum(Value), by=Pcode, having=any(Location=="A" & NoSales == "Y")][, sum(V1)]
mais um https://stackoverflow.com/a/49366998/
autre https://stackoverflow.com/a/49919015/
y https://stackoverflow.com/questions/50257643/deleting-rows-in-r-with-value-less-than-x
moar https://stackoverflow.com/q/54582048
e https://stackoverflow.com/q/56283005
keep groups if .N==k (also many at the dupe target) https://stackoverflow.com/questions/56794306/only-get-data-table-groups-with-a-given-number-of-rows
keep groups if any(diff(sorted_col)) <= threshold https://stackoverflow.com/q/57512417
keep if max(x) < threshold https://stackoverflow.com/a/57698641