Pandas: pd.testing.assert_frame_equal doesn't do precision according to the doc

Created on 1 Feb 2019 · 11Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

import pandas as pd
import pandas.testing
df1 = pd.DataFrame([0.00016,                -0.154526,            -0.20580199999999998])
df2 = pd.DataFrame([0.00015981824253685772, -0.15452557802200317, -0.20580188930034637])
pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=3)

Problem description

This asserts, despite all columns being identical in the first 3 digits after the decimal point.

AssertionError: DataFrame.iloc[:, 0] are different

DataFrame.iloc[:, 0] values are different (33.33333 %)
[left]:  [0.00016, -0.154526, -0.20580199999999998]
[right]: [0.00015981824253685772, -0.15452557802200317, -0.20580188930034637]

It doesn't assert if check_less_precise=2 is used instead. So something is not right here. Is there some kind of a rounding issue here?

Doc:

check_less_precise : bool or int, default False

Specify comparison precision. Only used when check_exact is False.
5 digits (False) or 3 digits (True) after decimal points are compared.
If int, then specify the digits to compare

I understand the doc says check_less_precise defines how many digits after the decimal point are compared.

Unrelated: The doc should probably say "decimal point" (singular) as there is only one, no? and "specify the digits to compare" is vague, perhaps "In int, then specify how many digits after decimal point to compare"?

Here is a proposed updated doc entry:

Specify comparison precision. Only used when check_exact is False. int: How many digits after the decimal point to compare, False: 5 digits, True: 3 digits.

Expected Output

no assert for up to check_less_precise=4 in this example, the numbers start to diverge at digit 5.

and it's still unclear whether rounding is performed or not.

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-43-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_CA.UTF-8
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8

pandas: 0.24.0
pytest: 4.0.2
pip: 19.0.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.5
bs4: 4.7.1
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Bug Testing

Source

stas00

👍4

Most helpful comment

Any updates?

WillUdstrand on 12 Mar 2020

👍4

All 11 comments

Thanks for the report! This does look strange - investigation and PRs would certainly be welcome

WillAyd on 3 Feb 2019

What I have a hard time grasping is the way this function is designed. Unless I don't understand the documentation, how can it help me to compare these two numbers:

0.6000000
0.5999999

The approach of comparing only n number of decimals is so strange. These two numbers are almost identical, and no matter now many digits you set, this function will still assert failure if the 9's go for quite a few more digits.

For example, math.isclose has a relative and absolute tolerance, which makes total sense. So in the example above, I can say ask for say 0.1% tolerance and those 2 numbers will be close.

pd.testing.assert_frame_equal's approach is just totally unclear to me.

stas00 on 3 Feb 2019

👍1

I think the comparison is done in this function

https://github.com/pandas-dev/pandas/blob/f75a220ff1e5e027ef2b070430fd7f4490cdcbf0/pandas/_libs/testing.pyx#L42-L46

The code in the comment, however, does not use the (more strict) 0.5 function. In NumPy that function uses 1.5. There is also a comment there now to use NumPy's assert_allclose.

https://github.com/numpy/numpy/blob/d7272536955cb5bd662228787b761eab2ca2c729/numpy/testing/_private/utils.py#L897-L916

And assert_allclose calls a function that supports parameters for absolute and relative tolerance @stas00 . I tried adjusting the constant in the Pandas function to use 1.5 too, but then it becomes too lenient and several tests fail (was preparing a pull request because I thought it would be simpler...).

Instead, perhaps, it would be easier to replace the function by either something like the new function in NumPy, or perhaps some other function?

Cheers
Bruno

kinow on 3 Feb 2019

thank you for digging up the code, @kinow! So the description of the functionality needs to be improved - numpy's version is indeed much better explained.

What it does is comparing how many 0.000x decimals the difference is between 2 numbers, and not how many decimals of each number it looks at. and then there is 1/2...

Let's rewrite:

abs(desired - actual) < (0.5 * 10.0 ** -decimal)

to:

(abs(desired - actual) * 10.0**decimal) < 0.5

so it's easier to understand.

So 2 digits gives us:

 (0.6-0.599)*10**2 = 0.1 < 0.5 [True]
 (0.6-0.595)*10**2 = 0.5 = 0.5 [False]
 (0.6-0.590)*10**2 = 1   > 0.5 [False]

so 2 digits gives us a [0,0.005) absolute range tolerance [0, 0.5*1e-2)

and 3:

 (0.6-0.5999)*10**3 = 0.1 < 0.5 [True]
 (0.6-0.5995)*10**3 = 0.5 = 0.5 [False]
 (0.6-0.5990)*10**3 = 1   > 0.5 [False]

so 3 digits gives us a [0,0.0005) absolute range tolerance [0, 0.5*1e-3)

and so n digits gives us [0, 0.5*1e-n) absolute range tolerance.

So the description should probably use code instead of words:

assert abs(df2-df1)*10**n < 0.5, f"frames difference is equal or more than {0.5*10**-n}"

I hope I didn't miss a zero somewhere.

Except it doesn't seem to be the right function, since if I now apply this same logic to the original failing test to emulate check_less_precise=3:

import pandas as pd
import pandas.testing
df1 = pd.DataFrame([0.00016,                -0.154526,            -0.20580199999999998])
df2 = pd.DataFrame([0.00015981824253685772, -0.15452557802200317, -0.20580188930034637])
df3 = abs(df1.subtract(df2))*10**3
df3
#pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=3)

I get:

0   0.000182
1   0.000422
2   0.000111

none of which is >0.5, i.e. it shouldn't assert.

It should assert with check_less_precise=7 or higher, so somewhere 4 decimal places are lost, as it starts asserting with n=3, instead of n=7.

df3 = abs(df1.subtract(df2))*10**6 < 0.5

0   True
1   True
2   True



md5-9f2d90930737c01b4aef55b57cdf40f9



df3 = abs(df1.subtract(df2))*10**7 < 0.5



md5-9f2d90930737c01b4aef55b57cdf40f9



0   False
1   False
2   False



md5-9ac9bc867f19d8d034b0cd8e5279776f



import numpy as np
import numpy.testing
np.testing.assert_array_almost_equal([.00016,                 -0.154526,            -0.20580199999999998],
                                     [0.00015981824253685772, -0.15452557802200317, -0.20580188930034637],
                                     decimal=6)

doesn't fail, with decimal=7 it does - as expected.

stas00 on 3 Feb 2019

👍1

I just hit this problem. Unsure if this is still on anyone's radar, but it was pretty surprising for me. I also used numpy functions (np.isclose instead of np.testing.assert_array_almost_equal, which I'll move to in the future) to get around it.

If there is interest in updating this parameter, it seems to me like @kinow's suggestion of using these numpy functions is a good path forward, though I'm far from an expert on this.

zachlipp on 22 Apr 2019

👍1

Dear everybody, any update on this? I'm trying to compare only 2 decimals but it seems it still checks 3...

s-mariani on 3 Sep 2019

👍1

I just ran into this error with some code I am writing.. I have the check_less and check_exact arguments set but still get an assertion error. The message it prints the same numbers out to the maximum print distance of 15 decimal places.

usmcamp0811 on 21 Nov 2019

Hi @stas00

thank you for digging up the code, @kinow! So the description of the functionality needs to be improved - numpy's version is indeed much better explained.

Except it doesn't seem to be the right function, since

I'm also starting to think that that function may not be the best for what is documented in assert_frame_equal. Here's other ways to trigger the error.

import pandas as pd
df1 = pd.DataFrame([0.15])
df2 = pd.DataFrame([0.16])
pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=1)

import pandas as pd
df1 = pd.DataFrame([0.099999])
df2 = pd.DataFrame([0.09])  # 0.099 will apss
pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=1)

The function I mentioned before, is not actually called with these values.

https://github.com/pandas-dev/pandas/blob/0ffee8b7805e4be6348a615a0d73f846dada401f/pandas/_libs/testing.pyx#L206-L215

So for a=0.15, b=0.16, and decimal=1, then abs(0.15) < 1e-5) doesn't pass, and we end up in the else block. Having then:

if not decimal_almost_equal(1, fb / fa, decimal):
# or
if not decimal_almost_equal(1, 0.15 / 0.16, 1):
# or
if not decimal_almost_equal(1, 0.9375, 1):

# which will be
abs(desired - actual) < (0.5 * 10.0 ** -decimal)
# solving it
abs(1 - 0.9375) < (0.05)
0.0625 < 0.05

In this case, the ratio is not close enough. So the function is failing. However, the callee function was supposed to compare based on the digits after the decimal. So if decimal=1, from what I understand, it should get 0.15 and 0.16, and compare only 0.1 == 0.1, i.e. using only 1 decimal.

If instead of the ratio, we use the function directly with decimal_almost_equal(0.15, 0.16, 1), then it will work OK.

However, if we use the other example pair 0.099999 and 0.01, with decimal=1.

abs(a - b) < (0.5 * 10.0 ** -decimal)
abs(0.099999 - 0.01) < 0.05
0.08999900000000001 < 0.05

Still fails. Looks like decimal_almost_equal is not the right function for the comparison? I have a working function in my notebook, but it is using the simplest approach, that truncates the value instead of comparing differences, ratios, etc. Will prepare a PR soon for discussion :+1:

Not super confident that that is the proper solution though, so happy if others chime in with their suggestions.

kinow on 3 Dec 2019

Hmm, maybe I spoke too fast.

This commit has a unit test with the examples discussed here: https://github.com/kinow/pandas/commit/f45be0e87743a35b3c0e7ca4323e8f7e11149ec7

The test passes, but several other tests fail. For example,

# test_timeseries.test_pct_change_shift_over_nas
    def test_pct_change_shift_over_nas(self):
        s = Series([1.0, 1.5, np.nan, 2.5, 3.0])

        chg = s.pct_change()
        expected = Series([np.nan, 0.5, 0.0, 2.5 / 1.5 - 1, 0.2])
        tm.assert_series_equal(chg, expected)

Fails with

E   AssertionError: Series are different
E   
E   Series values are different (20.0 %)
E   [left]:  [nan, 0.5, 0.0, 0.6666666666666667, 0.19999999999999996]
E   [right]: [nan, 0.5, 0.0, 0.6666666666666667, 0.2]

The values that fail are 0.19999999999999996 and 0.2 (and check_less_precise=False, so decimal=5). Not sure if it is following what's in the docs - maybe we just need to update the docs after all?

check_less_precise : bool or int, default False
Specify comparison precision. Only used when check_exact is False.
5 digits (False) or 3 digits (True) after decimal points are compared.
If int, then specify the digits to compare.

This part is the most confusing for me: "digits (...) after decimal points are compared". If we have 5 digits, and 0.19999999999999996 and 0.2, the parts after the decimal points are 19999999999999996, and 2. Assuming we are to use only the 5 digits, then 19999 and 2 would be compared?

kinow on 3 Dec 2019

Any updates? It's been several updates, but the problem seems to persist.