Data.table: rleid and rleidv don't produce the expected output with multiple columns.

Created on 3 Nov 2020  路  2Comments  路  Source: Rdatatable/data.table

Hello

I'm using data.table 1.12.8 in R 3.6.2.

I was trying the examples in the help on rowid and rleid and found something strange.

set.seed(100)
df = data.table(x = sample.int(4, 20, replace = T), y = sample.int(4, 20, replace = T))
df$unique_id0 <- rleidv(df)
df[,unique_id2 :=rleidv(df, cols= c("x", "y" ))]
df[,unique_id1 := .GRP, c("x", "y" )]

df

    x y unique_id0 unique_id2 unique_id1
 1: 2 3          1          1          1
 2: 3 4          2          2          2
 3: 2 2          3          3          3
 4: 4 1          4          4          4
 5: 3 4          5          5          2
 6: 1 3          6          6          5
 7: 2 3          7          7          1
 8: 2 4          8          8          6
 9: 4 2          9          9          7
10: 3 1         10         10          8
11: 4 2         11         11          7
12: 2 3         12         12          1
13: 2 4         13         13          6
14: 4 4         14         14          9
15: 3 2         15         15         10
16: 2 4         16         16          6
17: 2 4         16         16          6
18: 3 4         17         17          2
19: 3 1         18         18          8
20: 3 3         19         19         11

Shouldn't the columns 2 and 3 be the same than the 4?
Shouldn't rleid and rleidv generate a run-length type id column considering both the "x" and "y" columns together?

sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252
[3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
[5] LC_TIME=Spanish_Spain.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.12.8 RevoUtils_11.0.3

loaded via a namespace (and not attached):
[1] colorspace_1.4-1 scales_1.1.1 compiler_3.6.2 R6_2.4.1 tools_3.6.2
[6] lifecycle_0.2.0 munsell_0.5.0 rlang_0.4.6.9000

question

Most helpful comment

As @sindribaldur points out, rleid increments each time there is a change. Whereas dt[, ID := .GRP, by = grp] indicates the group number ordered by appearance of grp.

library(data.table)
set.seed(123L)
dt = data.table(x = sample.int(4, 20, replace = T), y = sample.int(4, 20, replace = T))

dt[, unique_id0 := rleidv(.SD)] #dt$unique_id0 <- rleidv(df)
dt[,unique_id2 :=rleidv(.SD, cols= c("x", "y" ))] # previously rleidv(dt, ...)
dt[,unique_id1 := .GRP, c("x", "y" )]

dt
#>         x     y unique_id0 unique_id2 unique_id1
#>     <int> <int>      <int>      <int>      <int>
#>  1:     3     1          1          1          1
#>  2:     3     4          2          2          2
#>  3:     3     1          3          3          1
#>  4:     2     1          4          4          3
#>  5:     3     1          5          5          1
#>  6:     2     3          6          6          4
#>  7:     2     4          7          7          5
#>  8:     2     2          8          8          6
#>  9:     3     3          9          9          7
#> 10:     1     2         10         10          8
#> 11:     4     1         11         11          9
#> 12:     2     2         12         12          6
#> 13:     2     3         13         13          4
#> 14:     1     4         14         14         10
#> 15:     2     2         15         15          6
#> 16:     3     1         16         16          1
#> 17:     4     3         17         17         11
#> 18:     1     3         18         18         12
#> 19:     3     1         19         19          1
#> 20:     3     4         20         20          2
#>         x     y unique_id0 unique_id2 unique_id1

unique(dt, by = c('x', 'y'))
#>         x     y unique_id0 unique_id2 unique_id1
#>     <int> <int>      <int>      <int>      <int>
#>  1:     3     1          1          1          1
#>  2:     3     4          2          2          2
#>  3:     2     1          4          4          3
#>  4:     2     3          6          6          4
#>  5:     2     4          7          7          5
#>  6:     2     2          8          8          6
#>  7:     3     3          9          9          7
#>  8:     1     2         10         10          8
#>  9:     4     1         11         11          9
#> 10:     1     4         14         14         10
#> 11:     4     3         17         17         11
#> 12:     1     3         18         18         12

Please re-open if there's documentation that is unclear.

All 2 comments

I don't think they should be the same. Consider this simpler example rleid(c(1,1,2,3,2)), we clearly have only 3 groups 1, 2 and 3. rleid() increments everytime there is a change.

As @sindribaldur points out, rleid increments each time there is a change. Whereas dt[, ID := .GRP, by = grp] indicates the group number ordered by appearance of grp.

library(data.table)
set.seed(123L)
dt = data.table(x = sample.int(4, 20, replace = T), y = sample.int(4, 20, replace = T))

dt[, unique_id0 := rleidv(.SD)] #dt$unique_id0 <- rleidv(df)
dt[,unique_id2 :=rleidv(.SD, cols= c("x", "y" ))] # previously rleidv(dt, ...)
dt[,unique_id1 := .GRP, c("x", "y" )]

dt
#>         x     y unique_id0 unique_id2 unique_id1
#>     <int> <int>      <int>      <int>      <int>
#>  1:     3     1          1          1          1
#>  2:     3     4          2          2          2
#>  3:     3     1          3          3          1
#>  4:     2     1          4          4          3
#>  5:     3     1          5          5          1
#>  6:     2     3          6          6          4
#>  7:     2     4          7          7          5
#>  8:     2     2          8          8          6
#>  9:     3     3          9          9          7
#> 10:     1     2         10         10          8
#> 11:     4     1         11         11          9
#> 12:     2     2         12         12          6
#> 13:     2     3         13         13          4
#> 14:     1     4         14         14         10
#> 15:     2     2         15         15          6
#> 16:     3     1         16         16          1
#> 17:     4     3         17         17         11
#> 18:     1     3         18         18         12
#> 19:     3     1         19         19          1
#> 20:     3     4         20         20          2
#>         x     y unique_id0 unique_id2 unique_id1

unique(dt, by = c('x', 'y'))
#>         x     y unique_id0 unique_id2 unique_id1
#>     <int> <int>      <int>      <int>      <int>
#>  1:     3     1          1          1          1
#>  2:     3     4          2          2          2
#>  3:     2     1          4          4          3
#>  4:     2     3          6          6          4
#>  5:     2     4          7          7          5
#>  6:     2     2          8          8          6
#>  7:     3     3          9          9          7
#>  8:     1     2         10         10          8
#>  9:     4     1         11         11          9
#> 10:     1     4         14         14         10
#> 11:     4     3         17         17         11
#> 12:     1     3         18         18         12

Please re-open if there's documentation that is unclear.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

alex46015 picture alex46015  路  3Comments

rafapereirabr picture rafapereirabr  路  3Comments

jameslamb picture jameslamb  路  3Comments

symbalex picture symbalex  路  3Comments

st-pasha picture st-pasha  路  3Comments