When I have a large dataset with a date column read in as a character, I can convert it to Date type like so
set.seed(2018)
dt <- data.table(
DateChar = as.character(as.Date("2000-01-01") + sample(10, replace = T, size = 10^7))
)
# Simple conversion using as.Date
system.time(dt[, Date1 := as.Date(DateChar)]) # ~59 seconds
# Simple conversion using as.IDate
system.time(dt[, Date2 := as.IDate(DateChar)]) # ~59 seconds
But this is painfully slow. So, I usually build a table of the unique date characters, convert those to Date type, and then do a join + insertion back into my original table
system.time({
uniqueDates <- unique(dt[, list(DateChar)])
uniqueDates[, Date3 := as.Date(DateChar)]
dt[uniqueDates, Date3 := i.Date3, on = "DateChar"]
}) # ~0.5 seconds (about 120 times speedup!)
This results in an enormous speedup. I was wondering if similar logic could be embedded in data.table to do this internally where it's appropriate. I haven't given it a ton of thought, but here are some basic points
day(), weekday(), ... (I don't think you could generalize this behavior for any function. Imagine trying to apply this technique when the user does something like dt[, foo := myfunc(foo)] where myfunc <- function(x) ifelse(x < 10, x, rnorm(n = 1)). With that said, having a curated set of some functions where this technique can be applied would still be greatly helpful.)Obviously I haven't fleshed out all the details of how this would work, but hopefully my idea is clear. The performance boost of this technique would be incredibly helpful to my workflow (after all the reason I use data.table over dplyr and pandas is because of its superior performance). Searching through the issues, @MichaelChirico touched on this a bit in #2503
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.2
data.table_1.10.5
Matt mentioned doing something like that for fwrite:
(...copying here since comments on SO can get deleted...)
Since you have 100M rows, I suppose you have some repeating dates, so it might be faster to do something like A[, theDate := as.character(theDate[1L]), by=theDate]. fwrite is fairly new, so I'm guessing this treatment of dates is an oversight and they'd welcome a feature request on improving it. – Frank May 20 '16 at 14:05
@ Frank Nice idea. Dates are on the long to do list: #1664. Your idea triggered a thought ... fwrite could do a lookup internally rather than converting each and every date separately. – Matt Dowle Jun 21 '16 at 19:01
And looking at fwrite, I see a lookup for month and day, I think, and maybe that can be reused...?
Thanks @franknarf1 but I'm thinking more general (not just dates). Another area where this comes up a lot for me is with regex. For example
set.seed(2017)
str_pool <- stringi::stri_rand_strings(n = 1000, length = 20)
dt <- data.table(Foo = sample(str_pool, size = 10^7, replace = T))
system.time(dt[, NumVowels1 := stringr::str_count(Foo, "[A-Za-z]")]) # ~10.798
system.time({
uniqueFoos <- unique(dt[, list(Foo)])
uniqueFoos[, NumVowels2 := stringr::str_count(Foo, "[A-Za-z]")]
dt[uniqueFoos, NumVowels2 := i.NumVowels2, on = "Foo"]
}) # ~0.736
If I had the ability to tell data.table to apply this technique with something like dt[, NumVowels1 := stringr::str_count(Foo, "[A-Za-z]"), useLookup = T] that would be ideal.
EDIT
I just got what you did @franknarf1. In fact system.time(dt[, NumVowels3 := stringr::str_count(Foo, "[A-Za-z]")[1L], by = Foo]) # ~0.252 runs even faster than what I had. Still, it would be nice if this technique was more integrated with data.table. For example
dt[, `:=`(
NumVowels1 = stringr::str_count(Foo, "[A-Za-z]"),
NumConsonants1 = stringr::str_count(Foo, "[^A-Za-z]"),
...,
useLookup = T
)]
would be convenient.
For this technique to be effective, the ratio nUniqueValues / nRows should be significantly less than 1. If it is close to one then we'd be just wasting time on grouping or maintaining a dictionary. Unfortunately figuring out the number of uniques is expensive in itself (unless it is somehow known beforehand)...
I also use this technique when I know beforehand that the number of unique values is limited compared to the number of rows, which is not an uncommon scenario. I think the core message of @ben519 is that data.table could provide a switch (like useLookup = FALSE) which is FALSE by default but can be turned on if required.
question here is about API... I think adding useLookup as an option to [.data.table is probably infeasible given the need to know the uniqueN/N ratio for it to be worthwhile, something that's generally easier for the user than for data.table to know (perhaps, like auto-indexing, we could keep the uniqueN available in attr for columns whenever it's calculated to facilitate this? also, related, #2458).
you're right that doing the lookup & join by hand every time can be tedious... perhaps useLookup could instead be a _function_ which wraps around other functions to apply the technique when prompted, a la useLookup(as.Date(x)) would know to apply as.Date using a lookup-and-join approach rather than applying as.Date to the full vector (lazy-eval is our friend here)...
We have tried lots of ways to get dates in faster. It is a huge speed issue for large datasets. There are some functions that parse date time faster, but they have issues. Lubridate has a fast_strptime function that speeds parsing up a bit. There is also the fasttime package, and the iotools package. So, one option is to roll your own parser, or borrow one from somewhere else. This has to be a solved problem somewhere. I would do this if I had the programming ability, but that is well beyond my skill set.
As for the ratio of uniqueN to N, you could just use a rule of thumb. There are only ~36,500 unique dates in the last 100 years. So, any column over, say 400,000 values must have a ratio of less than 10% (and is probably much lower).
You could also omit the lookup step via:
dt[, Date3 := as.IDate(.BY[[1L]]),by=DateChar]
Maybe this could be somehow done via an additional argument?
Raising this again. Ran into it yesterday working with a data set spanning 13 days with 25 million rows -- as.IDate on the whole dataset took forever but as.IDate on the unique values + merge back was far faster (especially since the data was already sorted).
Thinking of filing a PR -- idea is to add an argument to as.IDate which would induce the merge-based method (probably by default is the best way? given the results below).
Ran a quick benchmark to evaluate the benefit. Benefits are biggest when there are few unique dates relative to the number of rows (makes sense, and I think this case is common -- there are only so many days after all...)
Larger values of unq_pct are impossible for n = 1e7 since dates so large become invalid.
library(data.table)
library(microbenchmark)
set.seed(203949550)
times = CJ(n_rows = 10^(1:7), unq_pct = 1:9/10)
for (ii in times[is.na(ratio), which = TRUE]) {
input = times[ii, as.character(.Date(sample(unq_pct*n_rows, n_rows, TRUE)))]
t0 = get_nanotime()
as.IDate(input)
elapsed = get_nanotime() - t0
times[ii, raw_IDate := elapsed]
t0 = get_nanotime()
DT = data.table(input)
DT[DT[ , TRUE, keyby = input][ , IDate := as.IDate(input)],
on = 'input', i.IDate]
elapsed = get_nanotime() - t0
times[ii, use_merge := elapsed]
print(times[ii])
}
times[ , ratio := use_merge/raw_IDate][]
# n_rows unq_pct raw_IDate use_merge ratio
# 1: 1e+01 0.1 199959 3021756 15.1118779
# 2: 1e+01 0.2 165319 3474975 21.0198162
# 3: 1e+01 0.3 176818 3166302 17.9071248
# 4: 1e+01 0.4 158889 3153484 19.8470882
# 5: 1e+01 0.5 171664 3242940 18.8912061
# 6: 1e+01 0.6 178870 3320678 18.5647565
# 7: 1e+01 0.7 186232 19897569 106.8429110
# 8: 1e+01 0.8 195861 15482463 79.0482179
# 9: 1e+01 0.9 177210 9200518 51.9187292
# 10: 1e+02 0.1 632861 20505854 32.4018292
# 11: 1e+02 0.2 885172 20330939 22.9683485
# 12: 1e+02 0.3 615945 22799364 37.0152595
# 13: 1e+02 0.4 711640 6825605 9.5913734
# 14: 1e+02 0.5 605054 3534530 5.8416769
# 15: 1e+02 0.6 637470 3570637 5.6012628
# 16: 1e+02 0.7 643126 4058829 6.3110946
# 17: 1e+02 0.8 613068 3210657 5.2370324
# 18: 1e+02 0.9 629629 3663639 5.8187266
# 19: 1e+03 0.1 4769450 3736763 0.7834788
# 20: 1e+03 0.2 5114147 4461260 0.8723371
# 21: 1e+03 0.3 5005257 5240364 1.0469720
# 22: 1e+03 0.4 5157089 5530070 1.0723239
# 23: 1e+03 0.5 5438516 6854953 1.2604455
# 24: 1e+03 0.6 5372007 7471935 1.3909019
# 25: 1e+03 0.7 5185766 8062086 1.5546567
# 26: 1e+03 0.8 5239942 7201256 1.3743007
# 27: 1e+03 0.9 5318541 7377168 1.3870661
# 28: 1e+04 0.1 52512012 11358165 0.2162965
# 29: 1e+04 0.2 53388903 17137704 0.3209975
# 30: 1e+04 0.3 54443715 23328555 0.4284894
# 31: 1e+04 0.4 55652452 29480601 0.5297269
# 32: 1e+04 0.5 55530846 35698163 0.6428529
# 33: 1e+04 0.6 55306087 36110368 0.6529185
# 34: 1e+04 0.7 55263600 40638049 0.7353493
# 35: 1e+04 0.8 55160108 41273226 0.7482441
# 36: 1e+04 0.9 55478597 43758029 0.7887371
# 37: 1e+05 0.1 351426054 80883563 0.2301581
# 38: 1e+05 0.2 326886378 165915966 0.5075646
# 39: 1e+05 0.3 354084604 212982177 0.6015008
# 40: 1e+05 0.4 336944421 283950983 0.8427235
# 41: 1e+05 0.5 335977548 330784138 0.9845424
# 42: 1e+05 0.6 336208725 338942515 1.0081312
# 43: 1e+05 0.7 346194055 369929968 1.0685625
# 44: 1e+05 0.8 340199877 402226454 1.1823239
# 45: 1e+05 0.9 342883541 418062408 1.2192548
# 46: 1e+06 0.1 3609760940 967885293 0.2681300
# 47: 1e+06 0.2 4047479480 1266096206 0.3128110
# 48: 1e+06 0.3 4380826654 1911159551 0.4362555
# 49: 1e+06 0.4 4570411455 2321844447 0.5080165
# 50: 1e+06 0.5 4298216359 3036054850 0.7063523
# 51: 1e+06 0.6 4351227178 3228498691 0.7419743
# 52: 1e+06 0.7 4704161234 3881075957 0.8250304
# 53: 1e+06 0.8 4762439856 4293166396 0.9014636
# 54: 1e+06 0.9 5161067131 3978574040 0.7708821
# 55: 1e+07 0.1 48453784641 10943946312 0.2258636
# 56: 1e+07 0.2 49662117478 15133634109 0.3047320
# 57: 1e+07 0.3 50916879803 NA NA
# 58: 1e+07 0.4 NA NA NA
# 59: 1e+07 0.5 NA NA NA
# 60: 1e+07 0.6 NA NA NA
# 61: 1e+07 0.7 NA NA NA
# 62: 1e+07 0.8 NA NA NA
# 63: 1e+07 0.9 NA NA NA
# n_rows unq_pct raw_IDate use_merge ratio
NA
@jangorecki I don't think I follow what you have in mind for a more general version of this. Any as.* calls within [.data.table?
As of now, what I implemented in #3279 is implemented within as.IDate.default; let's continue discussion there or perhaps open a new broader issue?
I should have read full topic before commenting. Your PR looks like a good way to go with this.
Most helpful comment
You could also omit the lookup step via:
dt[, Date3 := as.IDate(.BY[[1L]]),by=DateChar]
Maybe this could be somehow done via an additional argument?