Data.table: Grouped mean of difftime fails due to Cfastmean

Created on 20 Nov 2017  Â·  7Comments  Â·  Source: Rdatatable/data.table

When trying to round an aggregated set of difftime values whilst creating a new data.table from an already existing one, unexpected warnings are thrown and the resulting values are NA. However, creating the data.table first with aggregation, _then_ rounding works correctly.

Related stackoverflow question includes the example below.

# Minimal reproducible example

library(data.table)
dt<-data.table(
  Date1 = 
    sample(seq(as.Date('2017/10/01'), 
               as.Date('2017/10/31'), 
               by="days"), 24, replace = FALSE) +
    abs(rnorm(24))/10,
  Date2 = 
    sample(seq(as.Date('2017/10/01'), 
               as.Date('2017/10/31'), 
               by="days"), 24, replace = FALSE) +
    abs(rnorm(24))/10,
  Num1 =
    abs(rnorm(24))*10,
  Group = 
    rep(LETTERS[1:4],each=6)
)
dt[,DiffTime:=abs(difftime(Date1,Date2,units = 'days'))]

# Warnings/NA:
class(dt$DiffTime) # "difftime"
dt2<-dt[,.(AvgTime = round(mean(DiffTime),2)), by = .(Group)]

# Works when numeric/not difftime:
class(dt$Num1) # "numeric"
dt2<-dt[,.(AvgNum = round(mean(Num1),2)), by = .(Group)]

# Works, but takes an additional step:
dt2<-dt[,.(AvgTime = mean(DiffTime)), by = .(Group)]
dt2[,AvgTime := round(AvgTime,2)]

# Output of sessionInfo()

> sessionInfo()
R version 3.2.5 (2016-04-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4

loaded via a namespace (and not attached):
[1] tools_3.2.5

# Output of sessionInfo()

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4-2

loaded via a namespace (and not attached):
[1] tools_3.4.2

Most helpful comment

I recently ran into the same problem using data.table_1.11.8. One quick work around is to use base::mean instead of mean.

All 7 comments

FYI please set.seed whenever you're including an MRE with random values

Confirming the bug with this adaptation of the example:

library(data.table)
set.seed(203480)
oct17 = 
  seq(as.Date('2017/10/01'), 
      as.Date('2017/10/31'), by="days")
DT = data.table(
  Date1 = 
    sample(oct17, 24, replace = FALSE) +
    abs(rnorm(24))/10,
  Date2 = 
    sample(oct17, 24, replace = FALSE) +
    abs(rnorm(24))/10,
  Group = 
    rep(LETTERS[1:4], each=6)
)

DT[ , DiffTime := abs(difftime(Date1, Date2, units = 'days'))]

#problem: non-numeric warning
options(datatable.optimize = 1)
DT[ , round(mean(DiffTime)), by = Group, verbose = TRUE]
# Detected that j uses these columns: DiffTime 
# Finding groups using forderv ... 0 sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
# lapply optimization is on, j unchanged as 'round(mean(DiffTime))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
# memcpy contiguous groups took 0.000s for 4 groups
# eval(j) took 0.000s for 4 calls
# 0.001 secs
#    Group V1
# 1:     A NA
# 2:     B NA
# 3:     C NA
# 4:     D NA
# Warning messages:
# 1: In mean(DiffTime) : argument is not numeric or logical: returning NA
# 2: In mean(DiffTime) : argument is not numeric or logical: returning NA
# 3: In mean(DiffTime) : argument is not numeric or logical: returning NA
# 4: In mean(DiffTime) : argument is not numeric or logical: returning NA

# no problem
options(datatable.optimize = 0)
DT[ , round(mean(DiffTime)), by = Group, verbose = TRUE]
# Detected that j uses these columns: DiffTime 
# Finding groups using forderv ... 0 sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
# All optimizations are turned off
# Making each group and running j (GForce FALSE) ... 
# memcpy contiguous groups took 0.000s for 4 groups
# eval(j) took 0.000s for 4 calls
# 0 secs
#    Group      V1
# 1:     A 17 days
# 2:     B  8 days
# 3:     C 17 days
# 4:     D  9 days

#works in base R
tapply(DT$DiffTime, DT$Group, function(x) round(mean(x))) 
#  A  B  C  D 
# 17  8 17  9 

Strangely, despite the verbose output being the same in both cases (in that j appears unchanged), the results are different. @arunsrinivasan is it possible verbose output is wrong here, and that j has indeed been affected? It has nothing to do with round, by the way -- replace with any other function and the error is the same (clearly, since the error is coming from mean)

FYI please set.seed whenever you're including an MRE with random values

Sorry, I'll try to remember for next time. Thanks for your follow up!

happy to help. in this case it didn't matter -- the bug was easy to
reproduce -- but not all bus are so glaringly obvious, in which case bring
able to compare exact output becomes more crucial.

thanks for identifying this bug!

On Nov 24, 2017 4:54 AM, "James" notifications@github.com wrote:

FYI please set.seed whenever you're including an MRE with random values

Sorry, I'll try to remember for next time. Thanks for your follow up!

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/2491#issuecomment-346696222,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdXiR2qgmfUUIPsGE7KoOztQYpUcwks5s5duggaJpZM4Qkity
.

I recently ran into the same problem using data.table_1.11.8. One quick work around is to use base::mean instead of mean.

Removing GForce label... error applies when datatable.optimize=1 which tells me it's the use of Cfastmean that's the issue.

@MichaelChirico I've recently revisited this with 1.12.8 and I'm no longer having problems. Was there some unrelated change that inadvertently addressed this?

Was this page helpful?
0 / 5 - 0 ratings