From https://github.com/hadley/dplyr/issues/1073#issuecomment-131119903
require(data.table)
DT = data.table(x=1:2, y=1:6)
# x y
#1: 1 1
#2: 2 2
#3: 1 3
#4: 2 4
#5: 1 5
#6: 2 6
DT[, .SD, by=x] # rows for each group are collected together
# x y
#1: 1 1
#2: 1 3
#3: 1 5
#4: 2 2
#5: 2 4
#6: 2 6
Currently, there's no way of returning just the groups:
# expected result
# x
#1: 1
#2: 2
Would it be sensible to have DT[, list(), by = x] perform this kind of grouping; that is, as a special case where j is an empty list? Or is it expected that this would provide results identical to DT[, c(), by = x] in the development version of data.table?
Or, alternatively, the syntax DT[, by = x] could implicitly collapse over the grouping variable x? Right now it seems like that is a no-op.
@kevinushey, IIRC we'd like for DT[, , by = .] and DT[by=.] to mimic DT[, .SD, by=.]. Just because it makes more sense than just returning the groups (for data.table syntax).
DT[, .(), by=.] and DT[, list(), by=.] doesn't feel quite right to me, yet, as well. It seems like a very rare operation, but we should be able to do this. Will keep thinking. More suggestions of course welcome!
Related issue - #1242
It is possible using DT[, unique(.SD), .SDcols=.]. The DT[, list(), by=.] would be fine but the j of length 0 makes some inconsistency with non-0 rows returned value. Maybe DT[, list(.), by=.] would be better? currently just duplicates the column(s), that could be handled.
Some other thoughts:
DT[, NULL, by = x]: Treat NULL specially when a group has been specified?DT[, .NULL, by = x]: Use a special dot symbol as a proxy for 'no column(s) to generate', similar in spirit to the other .I variables?DT[, groups = x]: Add a new parameter that performs grouping and returns the grouped data.table; operates the same as by but does not accept a j expression?Of the options so far, I like DT[, .(), by = ...] the most - and the more I stare at it, the more it seems to be perfectly suited for this job.
I like DT[, .(), by = ...]
+1, seems as the most natural option to me as well.
Personally, I think DT[keyby=...] is most natural and easiest to read (by and keyby should be interchangeable here).
Since you don't want j, leaving it out seems natural. Also, it leaves open the possibility for adding an i argument, which could be handy. (For example you could want "all names / email combos that didn't get a response, _except_ known spam addresses.)
Linking to discussion here: http://stackoverflow.com/questions/36270046/is-it-possible-to-return-a-data-table-without-j also #1245
I like the @geneorama proposal, there is another issue on handling that #1105
Could eventually be defined as .SDcols=NULL, .SDcols=character().
Actually, there is a simple syntax for doing this:
DT[, .(x = unique(x))]
@aradev In current implementation, this is much slower than DT[ , TRUE, by = x].
Most helpful comment
Personally, I think
DT[keyby=...]is most natural and easiest to read (byandkeybyshould be interchangeable here).Since you don't want
j, leaving it out seems natural. Also, it leaves open the possibility for adding aniargument, which could be handy. (For example you could want "all names / email combos that didn't get a response, _except_ known spam addresses.)