Dplyr: No tolerance in dplyr's `all.equal.tbl_df()`

Created on 24 Nov 2016  路  8Comments  路  Source: tidyverse/dplyr

I've run into an issue where the calling all.equal() (which then get dispatched to dplyr or tibble's all_equal()) on a tibble with dplyr attached does not have any tolerance for minorly different values like all.equal() on a dataframe has. However, with just tibble attached, minor differences are ignored. In both cases, converting to a dataframe results in the expected match with tolerance. This was discussed before #332, and I can understand if the dplyr version does an exact comparison for speed's sake, but this behavior took me some time to diagnose exactly what's going on.

library(tibble)
tbl1 <- tibble(foo=1, bar=2)
tbl2 <- tibble(foo=(1)-1e-16, bar=2)

all.equal(data.frame(tbl1), data.frame(tbl2)) # TRUE
all.equal(tbl1, tbl2) # TRUE

library(dplyr)
all.equal(data.frame(tbl1), data.frame(tbl2)) # TRUE
all.equal(tbl1, tbl2) # does not work

I'm not sure if the best course of action would be adding the tolerance into the dplyr version of all_equal() or just clarifying in the documentation that the behavior of tibble's all_equal() and dplyr's all_equal() are different. (This is with R 3.3.2, Tibble 1.2, and dplyr 0.5.0)

feature

Most helpful comment

+1 for more tolerance in table comparisons

All 8 comments

Thanks. Adding all_equal() to tibble was probably a mistake, I'll have to dig up the reasoning behind this. I'd rather support only one implementation, for technical reasons it has to be in dplyr at this point.

+1 for more tolerance in table comparisons

Hello krlmlr,
I tried to follow the issue here, but I am not sure I completely understand what the resolution will be... I tried to do the following:
expect_equal(unlist(tbl1), unlist(tbl2))
so as to get to use all.equal.list but that makes my QA testing quite slow...
Is there a plan to resolve the tolerance issue?

This issue has been closed by mistake. The reproducible example in the first comment is correct.

@midoshammaa: If you just want expect_equal() to use base::all.equal() instead of dplyr::all_equal(), you could do

expect_equal(as.data.frame(tbl1), as.data.frame(tbl2))

but this comparison will now respect row and column order.

@hadley: all_equal() uses a join internally, which requires hashing. We could use a modified hash function to make all_equal() ignore minor differences for numeric columns. Please advise.

Hello, the way we ended up solving it on our end was to round numbers to a precision that is acceptable to management and risk management. As a background we use dplyr to analyze data and the results are saved in tbls.
In one of the tbls the results of a regression is stored. Intercept and Coefficients....
We have several servers running R and the coefficients differ in the order of 1e-14. This fails our testing framework which performs an all.equal on the tbls (rds files from one system and a dynamic run in the other).
The solution for now is to round the numbers and not rely on TOLERANCE variable in all.equal for tbls.
Thanks

Superseded by #2751

Was this page helpful?
0 / 5 - 0 ratings