Hello, I would like to know what is the appropriate way of matching patterns in Character columns.
In the vignettes it is clearly explained that Character columns should be preferred over Factor columns.
I know of the like function in the package but its performance seems fairly low in the bellow problem where I try to detect a pattern in a Character column and a Factor column.
Is there a recommended way to do pattern matching on character columns in data.table
library(data.table)
library(stringr)
library(microbenchmark)
set.seed(1L)
n_disctinct <- 100L
n_letters <- 5L
x <- replicate(n_disctinct, paste(sample(LETTERS, n_letters, replace = TRUE), collapse = ""))
n_length <- 1e6L
x <- sample(x = x, size = n_length, replace = TRUE)
f <- factor(x)
DT <- data.table(f = f, x = x)
fct_detect <- function(x, ...) {
stringr::str_detect(levels(x), ...)[x]
}
microbenchmark(
DT1 <- DT[x %like% "SSA"], # the like function on Character
DT2 <- DT[fct_detect(f, "SSA")], # naive detect function on Factor using str_detect on the levels
DT3 <- DT[f %like% "SSA"], # the like function on Factor
DT4 <- DT[x %chin% "SSATD"],
DT5 <- DT[x == "SSATD"],
DT6 <- DT[f == "SSATD"],
DT7 <- DT[(levels(f) == "SSATD")[f]],
times = 50L
) -> res
res
# Unit: milliseconds
# expr min lq mean median uq max neval
# DT1 <- DT[x %like% "SSA"] 137.224022 139.326541 153.066482 141.602557 148.522236 237.255871 50
# DT2 <- DT[fct_detect(f, "SSA")] 4.593743 4.896224 5.717059 5.080999 5.346516 9.457377 50
# DT3 <- DT[f %like% "SSA"] 19.556630 21.070560 24.242968 21.944684 29.048203 32.234688 50
# DT4 <- DT[x %chin% "SSATD"] 1.477981 1.629233 2.060127 1.709921 2.850643 3.092975 50
# DT5 <- DT[x == "SSATD"] 1.480074 1.654964 2.023678 1.710184 2.813284 3.068752 50
# DT6 <- DT[f == "SSATD"] 4.523973 4.854614 5.994524 5.023527 6.237378 9.392769 50
# DT7 <- DT[(levels(f) == "SSATD")[f]] 4.462589 4.716833 5.786381 5.031112 5.205085 9.351510 50
sapply(list(DT2, DT3), FUN = identical, DT1)
# [1] TRUE TRUE
sapply(list(DT5, DT6, DT7), FUN = identical, DT4)
# [1] TRUE TRUE TRUE
For problems with lots of duplicated strings, matching regular expression on the unique levels then mapping back is much faster, since the number of unique levels if fairly small. Under such circumstances, I suggest to use the factor column.
(In fact, converting characters to factors takes time. I mean DT2 <- DT[fct_detect(factor(x), "SSA")] may cost significantly more time to run)
In addition, the reason of DT2 being much faster than DT3 is because your implementation is faster by subsetting the result directly, saving the time from coercing the factor values into integer values + matching again.
https://github.com/Rdatatable/data.table/blob/7bf5748d725d3a7d74da7c6b5137bd0ea87a17bd/R/like.R#L5-L7
From ?[ :
The index object i can be numeric, logical, character or empty. Indexing by factors is allowed and is equivalent to indexing by the numeric codes (see factor) and not by the character values which are printed (for which use [as.character(i)]).
I think we should improve this case based on your findings.
Hello, it seems that you beat me to the PR :\
Thanks for your answer, can I digress a bit on this please ?
Let's run the same test as above but without the data.table subseting
set.seed(1L)
n_disctinct <- 100L
n_letters <- 5L
x <- replicate(n_disctinct, paste(sample(LETTERS, n_letters, replace = TRUE), collapse = ""))
n_length <- 1e6L
x <- sample(x = x, size = n_length, replace = TRUE)
f <- factor(x)
microbenchmark(
DT1 <- x %like% "SSA",
DT2 <- fct_detect(f, "SSA"),
DT3 <- f %like% "SSA",
DT4 <- x %chin% "SSATD",
DT5 <- x == "SSATD",
DT6 <- f == "SSATD",
DT7 <- (levels(f) == "SSATD")[f],
times = 50L
) -> res
res
# Unit: milliseconds
# expr min lq mean median uq max neval
# DT1 <- x %like% "SSA" 134.184019 134.799904 138.733458 135.376793 140.121385 164.333579 50
# DT2 <- fct_detect(f, "SSA") 2.530200 2.609741 2.779877 2.679097 2.853188 4.179983 50
# DT3 <- f %like% "SSA" 17.277338 17.505502 18.561721 17.905504 18.664402 25.701312 50
# DT4 <- x %chin% "SSATD" 8.996969 9.081639 9.731109 9.418635 9.816365 15.183587 50
# DT5 <- x == "SSATD" 4.909027 4.945915 5.174743 5.065182 5.291422 6.290251 50
# DT6 <- f == "SSATD" 2.469650 2.542999 2.640223 2.569298 2.665493 3.129144 50
# DT7 <- (levels(f) == "SSATD")[f] 2.443671 2.502199 2.569179 2.533719 2.600884 3.097417 50
You can see that
In my work we typically have tables of a few 10s millions rows and characters are usually "symbols (ie strings)" with low cardinality (100s unique). Industry standard is to use some factor-like objects. I red in different places (typically this well know SO q) that I should prefer Character.
I feel like for pattern matching, data.table is at a loss using strings vs factors, and somehow factors seems less optimized and I do not feel I can drop character for factors.
In the vignettes it is clearly explained that Character columns should be preferred over Factor columns.
Could you point to this? I would want to reword this as I don't think this is the case. Especially with all the internal parallelism in data.table that doesn't work on character columns. Don't think we can give any arbitrary advice but in general low-cardinality strings should be kept as factors IMO.
That's not on a vignette, that's on ?setkey
Despite its name, base::sort.list(x,method="radix") actually invokes a counting sort in R, not a radix sort. See do_radixsort in src/main/sort.c. A counting sort, however, is particularly suitable for sorting integers and factors, and we like it. In fact we like it so much that data.table contains a counting sort algorithm for character vectors using R's internal global string cache. This is particularly fast for character vectors containing many duplicates, such as grouped data in a key column. This means that character is often preferred to factor. Factors are still fully supported, in particular ordered factors (where the levels are not in alphabetic order).
I see you actually amended this SO post
I'd like to have a clearer picture of performance for factor vs character for cases where cardinality is small (which would be typical in say finance), is there such a document somewhere ? if not I can create a Rmd testing for J(), ==, pattern matching, merging if you guys think it could be useful (in that case I can commit the document somewhere).
I do think that would be quite useful. I'm not sure if vignette is the right medium for it (it would be our first such vignette) -- would a blog post be more appropriate?