Data.table: add a 'having' parameter to `[.data.table`

Created on 29 Aug 2014 · 28Comments · Source: Rdatatable/data.table

Currently, to have the equivalent (or something similar) of the SQL having clause you need to write a [.data.table first using by and then feed the result into the i parameter of a second [.data.table, like in:

dt <- data.table(id   = rep(1:2, each = 2),
                 var  = c(0.2, 0.5, 1.5, 1.3))

dt[dt[, mean(var) > 1, by = id]$id]
   id var
1:  2 1.5
2:  2 1.3

Another option is to use conditional statement inside j, very powerful, I do all the time, and so far there is nothing that the current syntax did not allow me to do. However having a having parameter I believe will allow writing much more clear and readable codes. For example the above can be written as:

dt[, if(mean(var) > 1) .SD, by = id]

What I propose is something like:

dt[, .SD, by = id, having = mean(var) > 1]

The idea is to have an expression that always evaluates to a logical of length 1 which would tell whether or not j has to be evaluated for the current group.

Thanks,
Michele

feature request

Source

nigmastar

👍8

Most helpful comment

Another example from SO. It could be used to select strictly unique rows (related to #1163 ):

DT = setDT(structure(list(id = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 
2, 3, 4), dsp = c(5, 6, 7, 8, 6, 6, 7, 8, 5, 6, 9, 8, 5, 6, 7, 
NA), status = c(FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, 
TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE)), .Names = c("id", 
"dsp", "status"), row.names = c(NA, -16L), class = "data.frame"))

# my current way to select "strictly unique" rows
Bigdt[, .N, by=names(Bigdt)][N == 1][, N := NULL][]

# could be...
Bigdt[, .SD, by=names(Bigdt), having ={.N == 1L}]

Note that Bigdt[, if (.N == 1L) .SD, by=names(Bigdt)] does not work here, since .SD is empty. Maybe that could be helped by #1269 , though.

And another from SO: http://stackoverflow.com/q/38272608/ They want to select groups based on stuff in the last row, so having =heath condition[.N] == "not healthy" should do it.

And another simple case (filtering by size): http://stackoverflow.com/q/39085450/

And another, with an anti join:

ID <- c("A","A","A","B","B","C","D")
Value <- c(0,1,2,0,2,0,0)
df <- data.frame(ID,Value)

library(data.table)
setDT(df)

# use j = max() to get GForce speedup
df[ !df[, max(Value), by=ID][V1 > 0, .(ID, Value = 0)], on=.(ID, Value)]

# do the more standard thing, if j = if (...) x
df[ !df[, if (max(Value) > 0) .(Value = 0), by=ID], on=.(Value, ID) ]

# desired syntax
df[ !df[, .(Value = 0), by=ID, having = max(Value) > 0], on=.(Value, ID) ]

Not such a great example, though.

And another with an answer like dt[, if(uniqueN(time)==1L) .SD, by=name, .SDcols="time"]

And another: http://stackoverflow.com/q/43354165/

And another: http://stackoverflow.com/q/43613087/

Another (though it might get deleted): http://stackoverflow.com/q/43635968/

Another http://stackoverflow.com/a/43765352/

Another http://chat.stackoverflow.com/transcript/message/37148860#37148860

Another https://stackoverflow.com/questions/45464333/assign-a-binary-vector-based-on-blocks-of-data-within-another-vector/

Another https://stackoverflow.com/questions/32259620/how-to-remove-unique-entry-and-keep-duplicates-in-r/32259758#32259758

Un autre https://stackoverflow.com/q/45557011/

Haiyou https://stackoverflow.com/questions/45598397/filter-data-frame-matching-all-values-of-a-vector

Um mais https://stackoverflow.com/a/45721286/

lingwai yige https://stackoverflow.com/a/45820567/

and https://stackoverflow.com/q/46251221/

uno mas https://stackoverflow.com/questions/46307315/show-sequences-that-include-a-variable-in-r

tambem https://stackoverflow.com/q/46638058/

And another. I want to subset my data.table (myDT) to entries that aren't found in a reference table (idDT):

library(data.table)
idDT = data.table(id = 1:3, v = c("A","B","C"))
myDT = data.table(id = 3:4, z = c("gah","egad"))

# my attempt
idDT[myDT, on=.(id), .SD[.N == 0L], by=.EACHI]
# Empty data.table (0 rows) of 2 cols: id,v

# workaround
myDT[, .SD[idDT[.SD, on=.(id), .N == 0, by=.EACHI]$V1]]

# desired notation (with having=)
myDT[, .SD, by = id, having = idDT[.BY, on=.(id), .N]==0L]

This would be inefficient, though, since my desired notation entails each by= value making a separate join to idDT. In that sense, maybe it's not the best example.

mais um https://stackoverflow.com/questions/47765283/r-data-table-group-by-where/47765308?noredirect=1#comment82524998_47765308 could do DT[, if (any(status == "A") && !any(status == "B")) .SD, by=id] or with a having parameter DT[, .SD, by=id, having = any(status == "A") && !any(status == "B")]

and then https://stackoverflow.com/a/48669032/ m[, if(isTRUE(any(passed))) .SD, by=id] should be m[by = id, having = isTRUE(any(passed))]

mais um exemplo https://stackoverflow.com/q/49072250/

ein anderer https://stackoverflow.com/a/49211292/ stock_profile[, sum(Value), by=Pcode, having=any(Location=="A" & NoSales == "Y")][, sum(V1)]

mais um https://stackoverflow.com/a/49366998/

autre https://stackoverflow.com/a/49919015/

y https://stackoverflow.com/questions/50257643/deleting-rows-in-r-with-value-less-than-x

moar https://stackoverflow.com/q/54582048

e https://stackoverflow.com/q/56283005

keep groups if .N==k (also many at the dupe target) https://stackoverflow.com/questions/56794306/only-get-data-table-groups-with-a-given-number-of-rows

keep groups if any(diff(sorted_col)) <= threshold https://stackoverflow.com/q/57512417

keep if max(x) < threshold https://stackoverflow.com/a/57698641

franknarf1 on 27 May 2016

👍7

All 28 comments

Great FR. I've been pondering about this use case for quite a while as well. We can do this without the additional argument like so:

dt[, .SD[mean(var)>1], by=id]

(But for speed, this'll need optimisation of .SD[.] internally - #735.)

It's most likely this case that we resort to .I instead:

dt[dt[, .I[mean(var) > 1], by=id]$V1]

And it'd be great to get this directly (even better if we can achieve it without having) - maybe if j expression evaluates to a 1-column logical vector? Just thinking out loud.

arunsrinivasan on 29 Aug 2014

👍1

Hi Arun. Thanks for the answer. Once optimisation of .SD will be available then this would be just a matter of 'taste' in terms of what is clearer to read between:

dt[, .SD[mean(var)>1], by=id]

and

dt[, .SD, by = id, having = mean(var) > 1]

Even though the second may also have a better appeal for people coming from other languages (SQL in particular). But again, this might be just my opinion. Maybe I just using SQL too much in the last period (lol).

nigmastar on 1 Sep 2014

👍1

As far as the taste part goes - I really dislike adding an extra param, when it can be accomplished with simple and standard syntax (i.e. the first option above).

eantonya on 2 Sep 2014

👍2

Curious. I was sure that you were that one most likely to appreciate this :-) (considering how much you wanted to eliminate by-without-by, mainly to improve readability, especially for people coming from other languages, if I remember correctly). Anyway, I know the two are quite different scenarios. I just wanted to share my point of view,:

I'm pretty sure it will be more readable for anyone not very familiar with R (or just data.table)
15 vs 14 (current) parameters of [.data.table doesn't really harm
You won't be forced to use it and it won't break any code. It will be an expression that, if provided, will possibly skip the execution of j for a particular group

nigmastar on 2 Sep 2014

The reasons I didn't like the silent by-without-by and "having" are actually the same - I don't like to remember extra stuff, whether it be extra params or extra strange behavior.

I would argue that the first expression you wrote is much easier to read, because you don't have to keep reading the line, then discover that some new param is specified, and have to go back to the beginning of the sentence and reevaluate your mental model of what's going on.

eantonya on 2 Sep 2014

What do you think of not adding having arg to [, but turning it into having() function and making it works in i argument, the same as order() works, e.g.:

dt[ having(var > 1), .(var = mean(var)), by = id ]
# would perform below without additional copy:
dt[, .(var = mean(var)), by = id ][ var > 1 ]

having would be a function to evaluate its argument in frame of dt and provide filtering to i.

jangorecki on 18 Jan 2015

👍1

I think this FR is closely tied with https://github.com/Rdatatable/data.table/issues/1269 "Returning only groups." I often want to get groups with some attribute and store them in a vector, like my_teams in this SO post. Here's the relevant line:

my_teams <- FantasyTeams[, max(table(Team)) <= 3, by=team_no][(V1)]$team_no
# or 
my_teams <- FantasyTeams[, if ( max(table(Team)) <= 3 ) 1, by=team_no]$team_no

With the having and "Returning only groups" FRs, this could be something like

my_teams <- FantasyTeams[, .(), by = team_no, having = { max(table(Team)) <= 3 }]$team_no

The code is just as long, but I prefer it, so I don't have to read j carefully to understand the objective.

franknarf1 on 30 Sep 2015

Another example from SO. The goal is to overwrite the Value column with 3L if some by-group level condition holds:

DT = setDT(structure(list(Ind = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), ID = c("A", 
"A", "A", "A", "B", "B", "B", "B"), RegionStart = c(1L, 101L, 
1L, 101L, 1L, 101L, 1L, 101L), RegionEnd = c(100L, 200L, 100L, 
200L, 100L, 200L, 100L, 200L), Value = c(3L, 2L, 3L, 2L, 3L, 
2L, 5L, 5L), TN = c("N", "N", "T", "T", "N", "N", "T", "T")), .Names = c("Ind", 
"ID", "RegionStart", "RegionEnd", "Value", "TN"), row.names = c(NA, 
-8L), class = "data.frame"))

# current syntax 
DT[, Value := { 
  fixit = ( Value[TN=="N"] != 3L ) & ( uniqueN(Value) == 1L )
  if (fixit) 3L else Value
}, by=.(ID, RegionStart)]

# with "having"
DT[,
  Value := 3L
, by=.(ID, RegionStart)
, having={ ( Value[TN=="N"] != 3L ) & ( n_distinct(Value) == 1L ) }]

Besides arguably nicer syntax, I guess the having= way might also be more efficient, since only a subset of by-groups need to be modified. The most efficient way without having= probably looks like...

myeyes = DT[, .I[ ( Value[TN=="N"] != 3L ) & ( uniqueN(Value) == 1L )], by=.(ID, RegionStart)]$V1
DT[ myeyes, Value := 3L]

# or 

mygs = DT[, ( Value[TN=="N"] != 3L ) & ( uniqueN(Value) == 1L ), by=.(ID, RegionStart)][(V1)][, V1 := NULL]
DT[ mygs, Value := 3L, on=names(mygs)]

which are rather convoluted.

Edit: And another example to update if/when this feature is available: http://stackoverflow.com/q/36292702
(2016/4/26:) http://stackoverflow.com/q/36869784
(2016/06/16:) http://stackoverflow.com/q/37855013/

franknarf1 on 29 Mar 2016

Another example from SO. It could be used to select strictly unique rows (related to #1163 ):

DT = setDT(structure(list(id = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 
2, 3, 4), dsp = c(5, 6, 7, 8, 6, 6, 7, 8, 5, 6, 9, 8, 5, 6, 7, 
NA), status = c(FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, 
TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE)), .Names = c("id", 
"dsp", "status"), row.names = c(NA, -16L), class = "data.frame"))

# my current way to select "strictly unique" rows
Bigdt[, .N, by=names(Bigdt)][N == 1][, N := NULL][]

# could be...
Bigdt[, .SD, by=names(Bigdt), having ={.N == 1L}]

Note that Bigdt[, if (.N == 1L) .SD, by=names(Bigdt)] does not work here, since .SD is empty. Maybe that could be helped by #1269 , though.

And another from SO: http://stackoverflow.com/q/38272608/ They want to select groups based on stuff in the last row, so having =heath condition[.N] == "not healthy" should do it.

And another simple case (filtering by size): http://stackoverflow.com/q/39085450/

And another, with an anti join:

ID <- c("A","A","A","B","B","C","D")
Value <- c(0,1,2,0,2,0,0)
df <- data.frame(ID,Value)

library(data.table)
setDT(df)

# use j = max() to get GForce speedup
df[ !df[, max(Value), by=ID][V1 > 0, .(ID, Value = 0)], on=.(ID, Value)]

# do the more standard thing, if j = if (...) x
df[ !df[, if (max(Value) > 0) .(Value = 0), by=ID], on=.(Value, ID) ]

# desired syntax
df[ !df[, .(Value = 0), by=ID, having = max(Value) > 0], on=.(Value, ID) ]

Not such a great example, though.

And another with an answer like dt[, if(uniqueN(time)==1L) .SD, by=name, .SDcols="time"]

And another: http://stackoverflow.com/q/43354165/

And another: http://stackoverflow.com/q/43613087/

Another (though it might get deleted): http://stackoverflow.com/q/43635968/

Another http://stackoverflow.com/a/43765352/

Another http://chat.stackoverflow.com/transcript/message/37148860#37148860

Another https://stackoverflow.com/questions/45464333/assign-a-binary-vector-based-on-blocks-of-data-within-another-vector/

Another https://stackoverflow.com/questions/32259620/how-to-remove-unique-entry-and-keep-duplicates-in-r/32259758#32259758

Un autre https://stackoverflow.com/q/45557011/

Haiyou https://stackoverflow.com/questions/45598397/filter-data-frame-matching-all-values-of-a-vector

Um mais https://stackoverflow.com/a/45721286/

lingwai yige https://stackoverflow.com/a/45820567/

and https://stackoverflow.com/q/46251221/

uno mas https://stackoverflow.com/questions/46307315/show-sequences-that-include-a-variable-in-r

tambem https://stackoverflow.com/q/46638058/

And another. I want to subset my data.table (myDT) to entries that aren't found in a reference table (idDT):

library(data.table)
idDT = data.table(id = 1:3, v = c("A","B","C"))
myDT = data.table(id = 3:4, z = c("gah","egad"))

# my attempt
idDT[myDT, on=.(id), .SD[.N == 0L], by=.EACHI]
# Empty data.table (0 rows) of 2 cols: id,v

# workaround
myDT[, .SD[idDT[.SD, on=.(id), .N == 0, by=.EACHI]$V1]]

# desired notation (with having=)
myDT[, .SD, by = id, having = idDT[.BY, on=.(id), .N]==0L]

This would be inefficient, though, since my desired notation entails each by= value making a separate join to idDT. In that sense, maybe it's not the best example.

and then https://stackoverflow.com/a/48669032/ m[, if(isTRUE(any(passed))) .SD, by=id] should be m[by = id, having = isTRUE(any(passed))]

mais um exemplo https://stackoverflow.com/q/49072250/

ein anderer https://stackoverflow.com/a/49211292/ stock_profile[, sum(Value), by=Pcode, having=any(Location=="A" & NoSales == "Y")][, sum(V1)]

mais um https://stackoverflow.com/a/49366998/

autre https://stackoverflow.com/a/49919015/

y https://stackoverflow.com/questions/50257643/deleting-rows-in-r-with-value-less-than-x

moar https://stackoverflow.com/q/54582048

e https://stackoverflow.com/q/56283005

keep groups if .N==k (also many at the dupe target) https://stackoverflow.com/questions/56794306/only-get-data-table-groups-with-a-given-number-of-rows

keep groups if any(diff(sorted_col)) <= threshold https://stackoverflow.com/q/57512417

keep if max(x) < threshold https://stackoverflow.com/a/57698641

franknarf1 on 27 May 2016

👍7

@eantonya IMHO, adding the having parameter will actually make it easier to remember. Excessive conciseness can be hard to memorize. Moreover, making data.table more like SQL isn't a bad idea.

In data.table FAQ:

2.16 I’ve heard that data.table syntax is analogous to SQL.
Yes : ...

ywhcuhk on 28 May 2016

👍1

@ywhuofu data.table already accepts order function to i argument, which is what base R user would expect. The same way we translate sql _ORDER_ to i = order(...) we can do with _HAVING_. It fits well, as i in data.frame is used for subsetting (_having_ is just delayed subsetting after aggregation) or reordering.

jangorecki on 9 Jul 2016

Could this be the API?

dt <- data.table(id   = rep(1:2, each = 2),
                 var  = c(0.2, 0.5, 1.5, 1.3))

dt[having.i(mean(var) > 1, by = id)]
  id var
1  2 1.5
2  2 1.3

I have implemented this version although it sets a restriction of only using gforce optimized functions as well as some functions that do not depend on groupings (e.g., +, |, &, etc.). In SQL, summary functions are what are used although I would understand if Cdogroups would want to be supported.

One additional note. It seems like it would be difficult to fit in dt[having(var > 3), .(var = mean(x)), by = .(grp)] within the current '[.data.table' code. There would need to be some checks to make sure the syntax were correct.
```
n = 1e6
grps = 1e5
head_n = 2L
dt = data.table::data.table(x = sample(grps, n, TRUE), y = runif(n))

A tibble: 2 x 13

expression min median

1 lw[having.i(.N < 2L | sum(y) > 11 | median(y) < 0.7, by = x)] 114.13ms 124.98ms
2 dt[dt[, .I[.N < 2L | sum(y) > 11 | median(y) < 0.7], by = x]$V1] 4000ms 4000ms

A tibble: 2 x 13

expression min median itr/sec mem_alloc gc/sec n_itr

1 lw[having.i(.N < 2L, by = x)] 30.2ms 35.3ms 27.9 8.02MB 3.99 14
2 dt[dt[, .I[.N < 2L], by = x]$V1] 106.1ms 110.4ms 8.81 6.13MB 10.6 5

ColeMiller1 on 15 Feb 2020

I would prefer this as an added parameter, named either as having= or group_filter= (or something else that doesn't rely on SQL awareness to know what it does on sight).

e.g. I think it would be confusing to combine row filters in i with group-level filters also in i

MichaelChirico on 15 Feb 2020

Would having = work on a subset of the data or would you only be able to use the i arg or the having arg? I also assume that having would occur prior to the j being evaluated. How would .BY and .GRP and soon .NGRP work with ```having = ````?

ColeMiller1 on 15 Feb 2020

There are not many syntactic choices:

Adding a new argument such as having
Leveraging existing arguments: i, j, by.

If row filter and group filter are both needed, dt[row_selector & group_selector, ...] does not look right therefore it seems that for such use case, row filter and group filter should not appear in the same argument. Then i is ruled out.

Then there won't be many syntactic choices.

Leveraging by could make it look confusing. For example,

dt[, .SD, by = having(.(id), mean(var > 1))]
dt[, .SD, by = id ~ mean(var) > 1]

Adding special function to j does not look good.

dt[, having(mean(var) > 1, .SD), by = id]

Now, the code I find look best to me is the most original version

dt[, if (mean(var) > 1) .SD, by = id]
dt[, if (mean(var) > 1) .(x = sum(x), y = sum(y)), by = id]

What I really want is keep the optimization done after the group filtering. Can we detect theif expression in j and optimize into it such as keeping GForce to work inside if statement?

renkun-ken on 15 Feb 2020

@renkun-ken Or overload another infix operator?

dt[, mean(var) > 1 ? .SD, by=id]

One advantage of a special symbol over if is that there's no chance the user put a matching else later.

franknarf1 on 15 Feb 2020

👍1

@franknarf1 It seems that while we are trying to detect if in j, we could also check if has else if and else. We could optimize if-only case and leave if-else un-optimized. Maybe later we could handle if-else case too. Personally I still prefer optimizing code to overriding or leveraging too much on existing operators.

renkun-ken on 15 Feb 2020

@franknarf1 this is cool C syntax, although not sure if it wouldn't complicate to much here.
var > 1 ? d : e could work as well, isn't it?

jangorecki on 15 Feb 2020

👍1

var > 1 ? d : e looks concise but only works for inline simple case since d and e could be anything like {...} and the operator precedence may be confusing. Are we only trying to allow .SD to perform pure group filtering, or any expression in j here?

Adding syntax has a problem that user needs to be aware that the syntax is specially handled and should not work inside j. For example, user may expect

dt[, mean(var) > 1 ? 0 : (sd(var) < 1 ? 1 : 0), by = id]

to work, and even

dt[, mean(var) > 1 ? 0 : 1]
dt[, mean(var) > 1 ? 0 : (sd(var) < 1 ? 1 : 0)]

to work in general.

renkun-ken on 15 Feb 2020

👍1

I'm a bit confused here.

Does

dt[, .SD, by = id, having = mean(var) > 1]

have any advantage over

dt[, if(mean(var) > 1) .SD, by = id]

since mean(var) > 1 will always be evaluated for each group. Does it only serve as a syntactic sugar or we are trying to optimize over this somehow to have higher performance?

renkun-ken on 15 Feb 2020

👍2

@jangorecki

@franknarf1 this is cool C syntax, although not sure if it wouldn't complicate to much here.
var > 1 ? d : e could work as well, isn't it?

Yeah, that would be cool. Operator precedence might get in the way without {}s as @renkun-ken pointed out (ex = quote(x & y ? a+b : v+w); str(rapply(as.list(ex), as.list, how="replace")))

I'm a bit confused here.

Does
dt[, .SD, by = id, having = mean(var) > 1]
have any advantage over
dt[, if(mean(var) > 1) .SD, by = id]
since mean(var) > 1 will always be evaluated for each group. Does it only serve as a syntactic sugar or we are trying to optimize over this somehow to have higher performance?

I guess until now I had preferred having= because I find it a little clearer to read and imagine it's easier to maintain as compared to adding further syntactical magic to j. On the other hand, I think I might instead prefer the j syntax magic, since

I am used to if () ... already; and like the ? way too if it's feasible.
If it is integrated in j, then no additional questions need to be answered about its behavior (eg, DT[, x := if (cond) y, by=id] creates NAs if the condition is met in some groups but not others and this behavior shouldn't need to be re-explained for having=).

Regarding optimization, it seems like there are a lot of examples where the having condition itself could benefit from some version of GForce, since it usually is an expression like max(x) > 0, max(x) == 0.

For my own use, besides the optimization, I guess it would be mostly useful for the return-only-groups case mentioned above https://github.com/Rdatatable/data.table/issues/1269

> dt[, if (mean(var) > 1) .(), by=id] 
> # instead of ...
> dt[, mean(var) > 1, by=id][V1 == TRUE, !"V1"]
   id
1:  2

franknarf1 on 15 Feb 2020

👍2

Nice points Frank. in addition to the huge compendium of use cases you've
built (thanks again btw!).

it may in fact be easier to do GForce in the having= version since we can
just apply the gforce logic to having similar to j rather than trying to do
NSE to accomplish the same.

though that may interact w Jan's WIP to move a lot of j code to C -- any
thoughts there Jan?

On Sat, Feb 15, 2020, 1:40 PM Frank notifications@github.com wrote:

@jangorecki https://github.com/jangorecki

@franknarf1 https://github.com/franknarf1 this is cool C syntax,
although not sure if it wouldn't complicate to much here.
var > 1 ? d : e could work as well, isn't it?

Yeah, that would be cool. Operator precedence might get in the way without
{}s as @renkun-ken https://github.com/renkun-ken pointed out (ex =
quote(x & y ? a+b : v+w); str(rapply(as.list(ex), as.list, how="replace"))
)

I'm a bit confused here.

Does

dt[, .SD, by = id, having = mean(var) > 1]

have any advantage over

dt[, if(mean(var) > 1) .SD, by = id]

since mean(var) > 1 will always be evaluated for each group. Does it only
serve as a syntactic sugar or we are trying to optimize over this somehow
to have higher performance?

I guess until now I had preferred having= because I find it a little
clearer to read and imagine it's easier to maintain as compared to adding
further syntactical magic to j. On the other hand, I think I might
instead prefer the j syntax magic, since

I am used to if () ... already; and like the ? way too if it's
feasible.

If it is integrated in j, then no additional questions need to be
answered about its behavior (eg, DT[, x := if (cond) y, by=id] creates
NAs if the condition is met in some groups but not others and this behavior
shouldn't need to be re-explained for having=).

Regarding optimization, it seems like there are a lot of examples where
the having condition itself could benefit from some version of GForce,
since it usually is an expression like max(x) > 0, max(x) == 0.

For my own use, besides the optimization, I guess it would be mostly
useful for the return-only-groups case mentioned above #1269
https://github.com/Rdatatable/data.table/issues/1269

dt[, if (mean(var) > 1) .(), by=id]

instead of ...

dt[, mean(var) > 1, by=id][V1 == TRUE, !"V1"]
id
1: 2

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/788?email_source=notifications&email_token=AB2BA5OCN4IW3N6QQJU6RJ3RC555BA5CNFSM4ATSQPMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL3CK7A#issuecomment-586556796,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AB2BA5MD7ZXWSRRHVEJM6C3RC555BANCNFSM4ATSQPMA
.

MichaelChirico on 15 Feb 2020

👍1

j code to be moved to C is the code that is responsible for column selection only, so guessing with argument. Won't interfere here.

jangorecki on 15 Feb 2020

Since the FR is for add a 'having' parameter..., the word having should be somewhere in the solution. Optimization for ternary operators seems like a separate issue.

My preference for having.i() is because of the mantra of data.table: subset / order in i, select in j, group in by. having is just a special case of subsetting.

Regardless, if there is a new argument having, would the API support an i arg? Most use cases do not seem to need that requirement.

ColeMiller1 on 15 Feb 2020

What should be the behavior of ordering? That is, most of the current approaches automatically re-order:

library(data.table)

dt = data.table(grp = c(1L, 2L, 1L, 2L), x = letters[sample(4L)])
dt
#>      grp      x
#>    <int> <char>
#> 1:     1      a
#> 2:     2      b
#> 3:     1      c
#> 4:     2      d
dt[dt[, .I[.N > 0L], by = grp]$V1]
#>      grp      x
#>    <int> <char>
#> 1:     1      a
#> 2:     1      c
#> 3:     2      b
#> 4:     2      d

Should the having arg return a result that is re-ordered according to by?

ColeMiller1 on 2 May 2020

What should be the behavior of ordering?
Should the having arg return a result that is re-ordered according to by?

@ColeMiller1 Fwiw, I would expect having= to only appear when by= also appears, so the results would be grouped as in your example with ...$V1.

franknarf1 on 3 May 2020

👍1

Yes I would expect the ordering to be consistent:

DT[i, j, by, having]
# < == >
DT[i, if (having) j, by]

MichaelChirico on 3 May 2020

👍1

I think there was no agreement on API, particularly on having new having argument in [. @mattdowle wdyt?
Current approach of using DT[, if (.N > 1L) .SD, col1] is good one, not really complicated, easy to extend, but little bit more difficult to optimise.
My idea was to use having as function call in i: DT[having(N > 1L), .N, col1], but then it is not possible to provide normal subsetting to i.
Alternatively new arg could be a sub-argument of by, haven't much thought about it but something like DT[, .N, by=.(col1, .having = N > 1L)], so the extra grouping-related argument is encapsulated into by argument. This is the proper way to scale up number of arguments.

jangorecki on 3 May 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Integration with magrittr

my-R-help · 39Comments

fread with large csv (44 GB) takes a lot of RAM in latest data.table dev version

geponce · 30Comments

rolling functions, rolling aggregates, sliding window, moving average

jangorecki · 39Comments

memory leak

pdbailey0 · 36Comments

Possible bug / misfeature in POSIXct creation

eddelbuettel · 30Comments