Dplyr: No tolerance in dplyr's `all.equal.tbl_df()`

Created on 24 Nov 2016 · 8Comments · Source: tidyverse/dplyr

I've run into an issue where the calling all.equal() (which then get dispatched to dplyr or tibble's all_equal()) on a tibble with dplyr attached does not have any tolerance for minorly different values like all.equal() on a dataframe has. However, with just tibble attached, minor differences are ignored. In both cases, converting to a dataframe results in the expected match with tolerance. This was discussed before #332, and I can understand if the dplyr version does an exact comparison for speed's sake, but this behavior took me some time to diagnose exactly what's going on.

library(tibble)
tbl1 <- tibble(foo=1, bar=2)
tbl2 <- tibble(foo=(1)-1e-16, bar=2)

all.equal(data.frame(tbl1), data.frame(tbl2)) # TRUE
all.equal(tbl1, tbl2) # TRUE

library(dplyr)
all.equal(data.frame(tbl1), data.frame(tbl2)) # TRUE
all.equal(tbl1, tbl2) # does not work

I'm not sure if the best course of action would be adding the tolerance into the dplyr version of all_equal() or just clarifying in the documentation that the behavior of tibble's all_equal() and dplyr's all_equal() are different. (This is with R 3.3.2, Tibble 1.2, and dplyr 0.5.0)

feature

Source

jonkeane

👍1

Most helpful comment

+1 for more tolerance in table comparisons

hannesmuehleisen on 3 Jan 2017

👍6

All 8 comments

Thanks. Adding all_equal() to tibble was probably a mistake, I'll have to dig up the reasoning behind this. I'd rather support only one implementation, for technical reasons it has to be in dplyr at this point.

krlmlr on 27 Nov 2016

👍2

+1 for more tolerance in table comparisons

hannesmuehleisen on 3 Jan 2017

👍6

Hello krlmlr,
I tried to follow the issue here, but I am not sure I completely understand what the resolution will be... I tried to do the following:
expect_equal(unlist(tbl1), unlist(tbl2))
so as to get to use all.equal.list but that makes my QA testing quite slow...
Is there a plan to resolve the tolerance issue?

midoshammaa on 22 Apr 2017

This issue has been closed by mistake. The reproducible example in the first comment is correct.

krlmlr on 25 Apr 2017

@midoshammaa: If you just want expect_equal() to use base::all.equal() instead of dplyr::all_equal(), you could do

expect_equal(as.data.frame(tbl1), as.data.frame(tbl2))

but this comparison will now respect row and column order.

krlmlr on 25 Apr 2017

@hadley: all_equal() uses a join internally, which requires hashing. We could use a modified hash function to make all_equal() ignore minor differences for numeric columns. Please advise.

krlmlr on 25 Apr 2017

Hello, the way we ended up solving it on our end was to round numbers to a precision that is acceptable to management and risk management. As a background we use dplyr to analyze data and the results are saved in tbls.
In one of the tbls the results of a regression is stored. Intercept and Coefficients....
We have several servers running R and the coefficients differ in the order of 1e-14. This fails our testing framework which performs an all.equal on the tbls (rds files from one system and a dynamic run in the other).
The solution for now is to round the numbers and not rely on TOLERANCE variable in all.equal for tbls.
Thanks

midoshammaa on 22 May 2017

Superseded by #2751