Hello
I'm using data.table 1.12.8 in R 3.6.2.
I was trying the examples in the help on rowid and rleid and found something strange.
set.seed(100)
df = data.table(x = sample.int(4, 20, replace = T), y = sample.int(4, 20, replace = T))
df$unique_id0 <- rleidv(df)
df[,unique_id2 :=rleidv(df, cols= c("x", "y" ))]
df[,unique_id1 := .GRP, c("x", "y" )]
df
x y unique_id0 unique_id2 unique_id1
1: 2 3 1 1 1
2: 3 4 2 2 2
3: 2 2 3 3 3
4: 4 1 4 4 4
5: 3 4 5 5 2
6: 1 3 6 6 5
7: 2 3 7 7 1
8: 2 4 8 8 6
9: 4 2 9 9 7
10: 3 1 10 10 8
11: 4 2 11 11 7
12: 2 3 12 12 1
13: 2 4 13 13 6
14: 4 4 14 14 9
15: 3 2 15 15 10
16: 2 4 16 16 6
17: 2 4 16 16 6
18: 3 4 17 17 2
19: 3 1 18 18 8
20: 3 3 19 19 11
Shouldn't the columns 2 and 3 be the same than the 4?
Shouldn't rleid and rleidv generate a run-length type id column considering both the "x" and "y" columns together?
sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252
[3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
[5] LC_TIME=Spanish_Spain.1252attached base packages:
[1] stats graphics grDevices utils datasets methods baseother attached packages:
[1] data.table_1.12.8 RevoUtils_11.0.3loaded via a namespace (and not attached):
[1] colorspace_1.4-1 scales_1.1.1 compiler_3.6.2 R6_2.4.1 tools_3.6.2
[6] lifecycle_0.2.0 munsell_0.5.0 rlang_0.4.6.9000
I don't think they should be the same. Consider this simpler example rleid(c(1,1,2,3,2)), we clearly have only 3 groups 1, 2 and 3. rleid() increments everytime there is a change.
As @sindribaldur points out, rleid increments each time there is a change. Whereas dt[, ID := .GRP, by = grp] indicates the group number ordered by appearance of grp.
library(data.table)
set.seed(123L)
dt = data.table(x = sample.int(4, 20, replace = T), y = sample.int(4, 20, replace = T))
dt[, unique_id0 := rleidv(.SD)] #dt$unique_id0 <- rleidv(df)
dt[,unique_id2 :=rleidv(.SD, cols= c("x", "y" ))] # previously rleidv(dt, ...)
dt[,unique_id1 := .GRP, c("x", "y" )]
dt
#> x y unique_id0 unique_id2 unique_id1
#> <int> <int> <int> <int> <int>
#> 1: 3 1 1 1 1
#> 2: 3 4 2 2 2
#> 3: 3 1 3 3 1
#> 4: 2 1 4 4 3
#> 5: 3 1 5 5 1
#> 6: 2 3 6 6 4
#> 7: 2 4 7 7 5
#> 8: 2 2 8 8 6
#> 9: 3 3 9 9 7
#> 10: 1 2 10 10 8
#> 11: 4 1 11 11 9
#> 12: 2 2 12 12 6
#> 13: 2 3 13 13 4
#> 14: 1 4 14 14 10
#> 15: 2 2 15 15 6
#> 16: 3 1 16 16 1
#> 17: 4 3 17 17 11
#> 18: 1 3 18 18 12
#> 19: 3 1 19 19 1
#> 20: 3 4 20 20 2
#> x y unique_id0 unique_id2 unique_id1
unique(dt, by = c('x', 'y'))
#> x y unique_id0 unique_id2 unique_id1
#> <int> <int> <int> <int> <int>
#> 1: 3 1 1 1 1
#> 2: 3 4 2 2 2
#> 3: 2 1 4 4 3
#> 4: 2 3 6 6 4
#> 5: 2 4 7 7 5
#> 6: 2 2 8 8 6
#> 7: 3 3 9 9 7
#> 8: 1 2 10 10 8
#> 9: 4 1 11 11 9
#> 10: 1 4 14 14 10
#> 11: 4 3 17 17 11
#> 12: 1 3 18 18 12
Please re-open if there's documentation that is unclear.
Most helpful comment
As @sindribaldur points out,
rleidincrements each time there is a change. Whereasdt[, ID := .GRP, by = grp]indicates the group number ordered by appearance ofgrp.Please re-open if there's documentation that is unclear.