Data.table: Use lookup (join to dictionary) for performance boost

Created on 1 Feb 2018 · 11Comments · Source: Rdatatable/data.table

When I have a large dataset with a date column read in as a character, I can convert it to Date type like so

set.seed(2018)
dt <- data.table(
  DateChar = as.character(as.Date("2000-01-01") + sample(10, replace = T, size = 10^7))
)

# Simple conversion using as.Date
system.time(dt[, Date1 := as.Date(DateChar)])  # ~59 seconds

# Simple conversion using as.IDate
system.time(dt[, Date2 := as.IDate(DateChar)])  # ~59 seconds

But this is painfully slow. So, I usually build a table of the unique date characters, convert those to Date type, and then do a join + insertion back into my original table

system.time({
  uniqueDates <- unique(dt[, list(DateChar)])
  uniqueDates[, Date3 := as.Date(DateChar)]
  dt[uniqueDates, Date3 := i.Date3, on = "DateChar"]
})  # ~0.5 seconds (about 120 times speedup!)

This results in an enormous speedup. I was wondering if similar logic could be embedded in data.table to do this internally where it's appropriate. I haven't given it a ton of thought, but here are some basic points

This could be extended for other methods like day(), weekday(), ... (I don't think you could generalize this behavior for any function. Imagine trying to apply this technique when the user does something like dt[, foo := myfunc(foo)] where myfunc <- function(x) ifelse(x < 10, x, rnorm(n = 1)). With that said, having a curated set of some functions where this technique can be applied would still be greatly helpful.)
In my example, there would obviously be a performance hit if all the dates in the table were unique. Personally, I'd be willing to live with that occasional hit, but maybe some dumb logic could be embedded like "if the number of values > 1M and the number of unique elements represent less than 50% of the total rows, then apply the technique". Counting the unique elements may be slow, so perhaps the user can opt in for the behavior or it could only be applied on columns containing an index.

Obviously I haven't fleshed out all the details of how this would work, but hopefully my idea is clear. The performance boost of this technique would be incredibly helpful to my workflow (after all the reason I use data.table over dplyr and pandas is because of its superior performance). Searching through the issues, @MichaelChirico touched on this a bit in #2503

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.2
data.table_1.10.5

idatitime joins performance

Source

ben519

Most helpful comment

You could also omit the lookup step via:

dt[, Date3 := as.IDate(.BY[[1L]]),by=DateChar]

Maybe this could be somehow done via an additional argument?

matthiaskaeding on 14 Feb 2018

👍2

All 11 comments

Matt mentioned doing something like that for fwrite:

(...copying here since comments on SO can get deleted...)

Since you have 100M rows, I suppose you have some repeating dates, so it might be faster to do something like A[, theDate := as.character(theDate[1L]), by=theDate]. fwrite is fairly new, so I'm guessing this treatment of dates is an oversight and they'd welcome a feature request on improving it. – Frank May 20 '16 at 14:05

@ Frank Nice idea. Dates are on the long to do list: #1664. Your idea triggered a thought ... fwrite could do a lookup internally rather than converting each and every date separately. – Matt Dowle Jun 21 '16 at 19:01

And looking at fwrite, I see a lookup for month and day, I think, and maybe that can be reused...?

franknarf1 on 1 Feb 2018

Thanks @franknarf1 but I'm thinking more general (not just dates). Another area where this comes up a lot for me is with regex. For example

set.seed(2017)
str_pool <- stringi::stri_rand_strings(n = 1000, length = 20)
dt <- data.table(Foo = sample(str_pool, size = 10^7, replace = T))

system.time(dt[, NumVowels1 := stringr::str_count(Foo, "[A-Za-z]")])  # ~10.798

system.time({
  uniqueFoos <- unique(dt[, list(Foo)])
  uniqueFoos[, NumVowels2 := stringr::str_count(Foo, "[A-Za-z]")]
  dt[uniqueFoos, NumVowels2 := i.NumVowels2, on = "Foo"]
})  # ~0.736

If I had the ability to tell data.table to apply this technique with something like dt[, NumVowels1 := stringr::str_count(Foo, "[A-Za-z]"), useLookup = T] that would be ideal.

EDIT
I just got what you did @franknarf1. In fact system.time(dt[, NumVowels3 := stringr::str_count(Foo, "[A-Za-z]")[1L], by = Foo]) # ~0.252 runs even faster than what I had. Still, it would be nice if this technique was more integrated with data.table. For example

dt[, `:=`(
  NumVowels1 = stringr::str_count(Foo, "[A-Za-z]"),
  NumConsonants1 = stringr::str_count(Foo, "[^A-Za-z]"),
  ...,
  useLookup = T
)]

would be convenient.

ben519 on 1 Feb 2018

For this technique to be effective, the ratio nUniqueValues / nRows should be significantly less than 1. If it is close to one then we'd be just wasting time on grouping or maintaining a dictionary. Unfortunately figuring out the number of uniques is expensive in itself (unless it is somehow known beforehand)...

st-pasha on 1 Feb 2018

I also use this technique when I know beforehand that the number of unique values is limited compared to the number of rows, which is not an uncommon scenario. I think the core message of @ben519 is that data.table could provide a switch (like useLookup = FALSE) which is FALSE by default but can be turned on if required.

tdeenes on 1 Feb 2018

question here is about API... I think adding useLookup as an option to [.data.table is probably infeasible given the need to know the uniqueN/N ratio for it to be worthwhile, something that's generally easier for the user than for data.table to know (perhaps, like auto-indexing, we could keep the uniqueN available in attr for columns whenever it's calculated to facilitate this? also, related, #2458).

you're right that doing the lookup & join by hand every time can be tedious... perhaps useLookup could instead be a _function_ which wraps around other functions to apply the technique when prompted, a la useLookup(as.Date(x)) would know to apply as.Date using a lookup-and-join approach rather than applying as.Date to the full vector (lazy-eval is our friend here)...

MichaelChirico on 2 Feb 2018

We have tried lots of ways to get dates in faster. It is a huge speed issue for large datasets. There are some functions that parse date time faster, but they have issues. Lubridate has a fast_strptime function that speeds parsing up a bit. There is also the fasttime package, and the iotools package. So, one option is to roll your own parser, or borrow one from somewhere else. This has to be a solved problem somewhere. I would do this if I had the programming ability, but that is well beyond my skill set.

As for the ratio of uniqueN to N, you could just use a rule of thumb. There are only ~36,500 unique dates in the last 100 years. So, any column over, say 400,000 values must have a ratio of less than 10% (and is probably much lower).

markdanese on 4 Feb 2018

You could also omit the lookup step via:

dt[, Date3 := as.IDate(.BY[[1L]]),by=DateChar]

Maybe this could be somehow done via an additional argument?

matthiaskaeding on 14 Feb 2018

👍2

Raising this again. Ran into it yesterday working with a data set spanning 13 days with 25 million rows -- as.IDate on the whole dataset took forever but as.IDate on the unique values + merge back was far faster (especially since the data was already sorted).

Thinking of filing a PR -- idea is to add an argument to as.IDate which would induce the merge-based method (probably by default is the best way? given the results below).

Ran a quick benchmark to evaluate the benefit. Benefits are biggest when there are few unique dates relative to the number of rows (makes sense, and I think this case is common -- there are only so many days after all...)

Larger values of unq_pct are impossible for n = 1e7 since dates so large become invalid.

library(data.table)
library(microbenchmark)

set.seed(203949550)
times = CJ(n_rows = 10^(1:7), unq_pct = 1:9/10)

for (ii in times[is.na(ratio), which = TRUE]) {
  input = times[ii, as.character(.Date(sample(unq_pct*n_rows, n_rows, TRUE)))]

  t0 = get_nanotime()
  as.IDate(input)
  elapsed = get_nanotime() - t0
  times[ii, raw_IDate := elapsed]

  t0 = get_nanotime()
  DT = data.table(input)
  DT[DT[ , TRUE, keyby = input][ , IDate := as.IDate(input)], 
     on = 'input', i.IDate]
  elapsed = get_nanotime() - t0
  times[ii, use_merge := elapsed]

  print(times[ii])
}

times[ , ratio := use_merge/raw_IDate][]
#     n_rows unq_pct   raw_IDate   use_merge       ratio
#  1:  1e+01     0.1      199959     3021756  15.1118779
#  2:  1e+01     0.2      165319     3474975  21.0198162
#  3:  1e+01     0.3      176818     3166302  17.9071248
#  4:  1e+01     0.4      158889     3153484  19.8470882
#  5:  1e+01     0.5      171664     3242940  18.8912061
#  6:  1e+01     0.6      178870     3320678  18.5647565
#  7:  1e+01     0.7      186232    19897569 106.8429110
#  8:  1e+01     0.8      195861    15482463  79.0482179
#  9:  1e+01     0.9      177210     9200518  51.9187292
# 10:  1e+02     0.1      632861    20505854  32.4018292
# 11:  1e+02     0.2      885172    20330939  22.9683485
# 12:  1e+02     0.3      615945    22799364  37.0152595
# 13:  1e+02     0.4      711640     6825605   9.5913734
# 14:  1e+02     0.5      605054     3534530   5.8416769
# 15:  1e+02     0.6      637470     3570637   5.6012628
# 16:  1e+02     0.7      643126     4058829   6.3110946
# 17:  1e+02     0.8      613068     3210657   5.2370324
# 18:  1e+02     0.9      629629     3663639   5.8187266
# 19:  1e+03     0.1     4769450     3736763   0.7834788
# 20:  1e+03     0.2     5114147     4461260   0.8723371
# 21:  1e+03     0.3     5005257     5240364   1.0469720
# 22:  1e+03     0.4     5157089     5530070   1.0723239
# 23:  1e+03     0.5     5438516     6854953   1.2604455
# 24:  1e+03     0.6     5372007     7471935   1.3909019
# 25:  1e+03     0.7     5185766     8062086   1.5546567
# 26:  1e+03     0.8     5239942     7201256   1.3743007
# 27:  1e+03     0.9     5318541     7377168   1.3870661
# 28:  1e+04     0.1    52512012    11358165   0.2162965
# 29:  1e+04     0.2    53388903    17137704   0.3209975
# 30:  1e+04     0.3    54443715    23328555   0.4284894
# 31:  1e+04     0.4    55652452    29480601   0.5297269
# 32:  1e+04     0.5    55530846    35698163   0.6428529
# 33:  1e+04     0.6    55306087    36110368   0.6529185
# 34:  1e+04     0.7    55263600    40638049   0.7353493
# 35:  1e+04     0.8    55160108    41273226   0.7482441
# 36:  1e+04     0.9    55478597    43758029   0.7887371
# 37:  1e+05     0.1   351426054    80883563   0.2301581
# 38:  1e+05     0.2   326886378   165915966   0.5075646
# 39:  1e+05     0.3   354084604   212982177   0.6015008
# 40:  1e+05     0.4   336944421   283950983   0.8427235
# 41:  1e+05     0.5   335977548   330784138   0.9845424
# 42:  1e+05     0.6   336208725   338942515   1.0081312
# 43:  1e+05     0.7   346194055   369929968   1.0685625
# 44:  1e+05     0.8   340199877   402226454   1.1823239
# 45:  1e+05     0.9   342883541   418062408   1.2192548
# 46:  1e+06     0.1  3609760940   967885293   0.2681300
# 47:  1e+06     0.2  4047479480  1266096206   0.3128110
# 48:  1e+06     0.3  4380826654  1911159551   0.4362555
# 49:  1e+06     0.4  4570411455  2321844447   0.5080165
# 50:  1e+06     0.5  4298216359  3036054850   0.7063523
# 51:  1e+06     0.6  4351227178  3228498691   0.7419743
# 52:  1e+06     0.7  4704161234  3881075957   0.8250304
# 53:  1e+06     0.8  4762439856  4293166396   0.9014636
# 54:  1e+06     0.9  5161067131  3978574040   0.7708821
# 55:  1e+07     0.1 48453784641 10943946312   0.2258636
# 56:  1e+07     0.2 49662117478 15133634109   0.3047320
# 57:  1e+07     0.3 50916879803          NA          NA
# 58:  1e+07     0.4          NA          NA          NA
# 59:  1e+07     0.5          NA          NA          NA
# 60:  1e+07     0.6          NA          NA          NA
# 61:  1e+07     0.7          NA          NA          NA
# 62:  1e+07     0.8          NA          NA          NA
# 63:  1e+07     0.9          NA          NA          NA
#     n_rows unq_pct   raw_IDate   use_merge       ratio

MichaelChirico on 12 Jan 2019

👍1