Data.table: Returning only groups

Created on 14 Aug 2015 · 14Comments · Source: Rdatatable/data.table

From https://github.com/hadley/dplyr/issues/1073#issuecomment-131119903

require(data.table)
DT = data.table(x=1:2, y=1:6)
#    x y
#1: 1 1
#2: 2 2
#3: 1 3
#4: 2 4
#5: 1 5
#6: 2 6
DT[, .SD, by=x] # rows for each group are collected together
#    x y
#1: 1 1
#2: 1 3
#3: 1 5
#4: 2 2
#5: 2 4
#6: 2 6

Currently, there's no way of returning just the groups:

# expected result
#    x
#1: 1
#2: 2

feature request

Source

arunsrinivasan

Most helpful comment

Personally, I think DT[keyby=...] is most natural and easiest to read (by and keyby should be interchangeable here).

Since you don't want j, leaving it out seems natural. Also, it leaves open the possibility for adding an i argument, which could be handy. (For example you could want "all names / email combos that didn't get a response, _except_ known spam addresses.)

geneorama on 28 Mar 2016

👍2

All 14 comments

Would it be sensible to have DT[, list(), by = x] perform this kind of grouping; that is, as a special case where j is an empty list? Or is it expected that this would provide results identical to DT[, c(), by = x] in the development version of data.table?

kevinushey on 14 Aug 2015

Or, alternatively, the syntax DT[, by = x] could implicitly collapse over the grouping variable x? Right now it seems like that is a no-op.

kevinushey on 14 Aug 2015

@kevinushey, IIRC we'd like for DT[, , by = .] and DT[by=.] to mimic DT[, .SD, by=.]. Just because it makes more sense than just returning the groups (for data.table syntax).

DT[, .(), by=.] and DT[, list(), by=.] doesn't feel quite right to me, yet, as well. It seems like a very rare operation, but we should be able to do this. Will keep thinking. More suggestions of course welcome!

arunsrinivasan on 14 Aug 2015

Related issue - #1242

eantonya on 14 Aug 2015

It is possible using DT[, unique(.SD), .SDcols=.]. The DT[, list(), by=.] would be fine but the j of length 0 makes some inconsistency with non-0 rows returned value. Maybe DT[, list(.), by=.] would be better? currently just duplicates the column(s), that could be handled.

jangorecki on 14 Aug 2015

Some other thoughts:

DT[, NULL, by = x]: Treat NULL specially when a group has been specified?
DT[, .NULL, by = x]: Use a special dot symbol as a proxy for 'no column(s) to generate', similar in spirit to the other .I variables?
DT[, groups = x]: Add a new parameter that performs grouping and returns the grouped data.table; operates the same as by but does not accept a j expression?