Data.table: Pattern matching for character columns compared to Factor columns

Created on 10 Oct 2020  路  5Comments  路  Source: Rdatatable/data.table

Hello, I would like to know what is the appropriate way of matching patterns in Character columns.
In the vignettes it is clearly explained that Character columns should be preferred over Factor columns.
I know of the like function in the package but its performance seems fairly low in the bellow problem where I try to detect a pattern in a Character column and a Factor column.
Is there a recommended way to do pattern matching on character columns in data.table

library(data.table)                                                                                            
library(stringr)                                                                                               
library(microbenchmark)

set.seed(1L)                                                                                                   
n_disctinct <- 100L                                                                                            
n_letters <- 5L                                                                                                
x <- replicate(n_disctinct, paste(sample(LETTERS, n_letters, replace = TRUE), collapse = ""))                  
n_length <- 1e6L                                                                                               
x <- sample(x = x, size = n_length, replace = TRUE)                                                            
f <- factor(x)                                                                                                 
DT <- data.table(f = f, x = x)                                                                                 

fct_detect <- function(x, ...) {                                                                               
    stringr::str_detect(levels(x), ...)[x]                                                                     
}                                                                                                              

microbenchmark(                                                                                                
    DT1 <- DT[x %like% "SSA"],                 # the like function on Character                                                                
    DT2 <- DT[fct_detect(f, "SSA")],            # naive detect function on Factor using str_detect on the levels                                                               
    DT3 <- DT[f %like% "SSA"],                  # the like function on Factor                                                                
    DT4 <- DT[x %chin% "SSATD"],                                                                               
    DT5 <- DT[x == "SSATD"],                                                                                   
    DT6 <- DT[f == "SSATD"],                                                                                   
    DT7 <- DT[(levels(f) == "SSATD")[f]],                                                                      
    times = 50L                                                                                                
    ) -> res                                                                                                   
res                                                                                                            
# Unit: milliseconds                                                                                           
#                                  expr        min         lq       mean     median         uq        max neval
#             DT1 <- DT[x %like% "SSA"] 137.224022 139.326541 153.066482 141.602557 148.522236 237.255871    50
#       DT2 <- DT[fct_detect(f, "SSA")]   4.593743   4.896224   5.717059   5.080999   5.346516   9.457377    50
#             DT3 <- DT[f %like% "SSA"]  19.556630  21.070560  24.242968  21.944684  29.048203  32.234688    50
#           DT4 <- DT[x %chin% "SSATD"]   1.477981   1.629233   2.060127   1.709921   2.850643   3.092975    50
#               DT5 <- DT[x == "SSATD"]   1.480074   1.654964   2.023678   1.710184   2.813284   3.068752    50
#               DT6 <- DT[f == "SSATD"]   4.523973   4.854614   5.994524   5.023527   6.237378   9.392769    50
#  DT7 <- DT[(levels(f) == "SSATD")[f]]   4.462589   4.716833   5.786381   5.031112   5.205085   9.351510    50

sapply(list(DT2, DT3), FUN = identical, DT1)       
# [1] TRUE TRUE                                    
sapply(list(DT5, DT6, DT7), FUN = identical, DT4)  
# [1] TRUE TRUE TRUE                               

enhancement

All 5 comments

For problems with lots of duplicated strings, matching regular expression on the unique levels then mapping back is much faster, since the number of unique levels if fairly small. Under such circumstances, I suggest to use the factor column.

(In fact, converting characters to factors takes time. I mean DT2 <- DT[fct_detect(factor(x), "SSA")] may cost significantly more time to run)

In addition, the reason of DT2 being much faster than DT3 is because your implementation is faster by subsetting the result directly, saving the time from coercing the factor values into integer values + matching again.

https://github.com/Rdatatable/data.table/blob/7bf5748d725d3a7d74da7c6b5137bd0ea87a17bd/R/like.R#L5-L7

From ?[ :

The index object i can be numeric, logical, character or empty. Indexing by factors is allowed and is equivalent to indexing by the numeric codes (see factor) and not by the character values which are printed (for which use [as.character(i)]).

I think we should improve this case based on your findings.

Hello, it seems that you beat me to the PR :\

Thanks for your answer, can I digress a bit on this please ?

Let's run the same test as above but without the data.table subseting

set.seed(1L)                                                                                               
n_disctinct <- 100L                                                                                        
n_letters <- 5L                                                                                            
x <- replicate(n_disctinct, paste(sample(LETTERS, n_letters, replace = TRUE), collapse = ""))              
n_length <- 1e6L                                                                                           
x <- sample(x = x, size = n_length, replace = TRUE)                                                        
f <- factor(x)                                                                                             
microbenchmark(                                                                                            
    DT1 <- x %like% "SSA",                                                                                 
    DT2 <- fct_detect(f, "SSA"),                                                                           
    DT3 <- f %like% "SSA",                                                                                 
    DT4 <- x %chin% "SSATD",                                                                               
    DT5 <- x == "SSATD",                                                                                   
    DT6 <- f == "SSATD",                                                                                   
    DT7 <- (levels(f) == "SSATD")[f],                                                                      
    times = 50L                                                                                            
    ) -> res                                                                                               
res                                                                                                        
# Unit: milliseconds                                                                                       
#                              expr        min         lq       mean     median         uq        max neval
#             DT1 <- x %like% "SSA" 134.184019 134.799904 138.733458 135.376793 140.121385 164.333579    50
#       DT2 <- fct_detect(f, "SSA")   2.530200   2.609741   2.779877   2.679097   2.853188   4.179983    50
#             DT3 <- f %like% "SSA"  17.277338  17.505502  18.561721  17.905504  18.664402  25.701312    50
#           DT4 <- x %chin% "SSATD"   8.996969   9.081639   9.731109   9.418635   9.816365  15.183587    50
#               DT5 <- x == "SSATD"   4.909027   4.945915   5.174743   5.065182   5.291422   6.290251    50
#               DT6 <- f == "SSATD"   2.469650   2.542999   2.640223   2.569298   2.665493   3.129144    50
#  DT7 <- (levels(f) == "SSATD")[f]   2.443671   2.502199   2.569179   2.533719   2.600884   3.097417    50

You can see that

  • matching (equality) on Factor should be twice as fast than on Character
    however when you add the data.table "subsetting" Character becomes faster (and even better it is actually faster than the Character matching itself). Looking at htop I see that there seem to be multithreading at play.
  • Am I right ?
    If yes
  • Why isn't there the same thing implemented for Factors ?

In my work we typically have tables of a few 10s millions rows and characters are usually "symbols (ie strings)" with low cardinality (100s unique). Industry standard is to use some factor-like objects. I red in different places (typically this well know SO q) that I should prefer Character.
I feel like for pattern matching, data.table is at a loss using strings vs factors, and somehow factors seems less optimized and I do not feel I can drop character for factors.

In the vignettes it is clearly explained that Character columns should be preferred over Factor columns.

Could you point to this? I would want to reword this as I don't think this is the case. Especially with all the internal parallelism in data.table that doesn't work on character columns. Don't think we can give any arbitrary advice but in general low-cardinality strings should be kept as factors IMO.

That's not on a vignette, that's on ?setkey

Despite its name, base::sort.list(x,method="radix") actually invokes a counting sort in R, not a radix sort. See do_radixsort in src/main/sort.c. A counting sort, however, is particularly suitable for sorting integers and factors, and we like it. In fact we like it so much that data.table contains a counting sort algorithm for character vectors using R's internal global string cache. This is particularly fast for character vectors containing many duplicates, such as grouped data in a key column. This means that character is often preferred to factor. Factors are still fully supported, in particular ordered factors (where the levels are not in alphabetic order).

I see you actually amended this SO post

I'd like to have a clearer picture of performance for factor vs character for cases where cardinality is small (which would be typical in say finance), is there such a document somewhere ? if not I can create a Rmd testing for J(), ==, pattern matching, merging if you guys think it could be useful (in that case I can commit the document somewhere).

I do think that would be quite useful. I'm not sure if vignette is the right medium for it (it would be our first such vignette) -- would a blog post be more appropriate?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rafapereirabr picture rafapereirabr  路  3Comments

DavidArenburg picture DavidArenburg  路  3Comments

sengoku93 picture sengoku93  路  3Comments

arunsrinivasan picture arunsrinivasan  路  3Comments

andschar picture andschar  路  3Comments