Data.table: Inaccurate sum of integer64 returned using by clause

Created on 13 Apr 2016  路  7Comments  路  Source: Rdatatable/data.table

Attached is a data.table with an integer64 column. Data.table returns the correct sum when the by clause is not used. However, when the by clause is used, the result is incorrect. The sum returned is interestingly the sum of the results except the final row.

Related scripts:

> setnames(DT,c("key","value"))
> DT
               key   value
1: 7355 1100047023 -168067
2: 7355 1100047023  168067
3: 7355 1100047023  141091
4: 7355 1100047023 -141091
> DT[,sum(value)]
integer64
[1] 0
> DT[,sum(value),by=key]
               key      V1
1: 7355 1100047023 -141091
> save(DT,file="integer64_sum_with_group_example.Rdata")

integer64_sum_with_group_example.zip

GForce bit64 bug

All 7 comments

Yes confirmed. GForce needs to be integer64 aware. Thanks for reporting!

Would it be possible to disable GForce if an integer64 column is detected until there is a proper fix? It is dangerous that data.table returns an incorrect result without failing or warning.

From the SO question (asked by myself), I will add some detail to this post. I also found the summing to be incorrect when negative integer64 class values are in the data.table, but I think it plays out differently than OP has described:

The sum returned is interestingly the sum of the results except the final row.

[this does not appear to be what is happening in OPs example]

By playing around with the data.table, I found that dt[, sum(col, na.rm=FALSE), by=group] is returning the negative row nearest to 0 in the data.table group as the sum.

With na.rm=TRUE, it returns the sum by group excluding the negative values.

Example:

library(data.table); library(bit64)
z <- data.table(
  group = c("A","A","A","A","B","B","B","B","B","B"),
  int64 = as.integer64(c(10,-2,20,-7,5,-8,-2,-1,-3,0))
)

z[, sum(int64, na.rm=FALSE), by=group]
#group  V1
#    A  -2
#    B  -1

z[, sum(int64, na.rm=TRUE), by=group]
#group V1
#    A 30
#    B  5

@moman822 Fyi, you see those results because gsum uses the core data (that you can see with unclass(z$int64)), which contains some NaNs.

> z[, v := replace(int64, is.nan(unclass(int64)), NA)]
> z
    group int64  v
 1:     A    10 10
 2:     A    -2 NA
 3:     A    20 20
 4:     A    -7 NA
 5:     B     5  5
 6:     B    -8 NA
 7:     B    -2 NA
 8:     B    -1 NA
 9:     B    -3 NA
10:     B     0  0

To avoid using gsum till this is supported, besides what akrun pointed out, you can wrap in parentheses:

z[, (sum(int64, na.rm = TRUE)), by=group]
z[, (sum(int64, na.rm = FALSE)), by=group]

@franknarf1

That makes sense to me for when na.rm=TRUE, but not for when na.rm=FALSE.

@moman822 The same idea applies:

library(integer64) # no need for data.table to see it
x = as.integer64(c(10,-2,20,-7))
structure(sum(unclass(x)), class="integer64")
# integer64
# [1] -2

We could take it to StackOverflow if it's still unclear (eg this chat room) since the package maintainers here on github seem to understand the problem already (from the second comment).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

andschar picture andschar  路  3Comments

lux5 picture lux5  路  3Comments

DavidArenburg picture DavidArenburg  路  3Comments

tcederquist picture tcederquist  路  3Comments

jangorecki picture jangorecki  路  3Comments