import pandas as pd
import pandas.testing
df1 = pd.DataFrame([0.00016, -0.154526, -0.20580199999999998])
df2 = pd.DataFrame([0.00015981824253685772, -0.15452557802200317, -0.20580188930034637])
pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=3)
This asserts, despite all columns being identical in the first 3 digits after the decimal point.
AssertionError: DataFrame.iloc[:, 0] are different
DataFrame.iloc[:, 0] values are different (33.33333 %)
[left]: [0.00016, -0.154526, -0.20580199999999998]
[right]: [0.00015981824253685772, -0.15452557802200317, -0.20580188930034637]
It doesn't assert if check_less_precise=2
is used instead. So something is not right here. Is there some kind of a rounding issue here?
Doc:
check_less_precise : bool or int, default False
Specify comparison precision. Only used when check_exact is False.
5 digits (False) or 3 digits (True) after decimal points are compared.
If int, then specify the digits to compare
I understand the doc says check_less_precise
defines how many digits after the decimal point are compared.
Unrelated: The doc should probably say "decimal point" (singular) as there is only one, no? and "specify the digits to compare" is vague, perhaps "In int, then specify how many digits after decimal point to compare"?
Here is a proposed updated doc entry:
Specify comparison precision. Only used when check_exact is False. int: How many digits after the decimal point to compare, False: 5 digits, True: 3 digits.
no assert for up to check_less_precise=4
in this example, the numbers start to diverge at digit 5.
and it's still unclear whether rounding is performed or not.
pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-43-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_CA.UTF-8
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8
pandas: 0.24.0
pytest: 4.0.2
pip: 19.0.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.5
bs4: 4.7.1
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
Thanks for the report! This does look strange - investigation and PRs would certainly be welcome
What I have a hard time grasping is the way this function is designed. Unless I don't understand the documentation, how can it help me to compare these two numbers:
0.6000000
0.5999999
The approach of comparing only n number of decimals is so strange. These two numbers are almost identical, and no matter now many digits you set, this function will still assert failure if the 9's go for quite a few more digits.
For example, math.isclose has a relative and absolute tolerance, which makes total sense. So in the example above, I can say ask for say 0.1% tolerance and those 2 numbers will be close.
pd.testing.assert_frame_equal
's approach is just totally unclear to me.
I think the comparison is done in this function
The code in the comment, however, does not use the (more strict) 0.5 function. In NumPy that function uses 1.5. There is also a comment there now to use NumPy's assert_allclose
.
And assert_allclose
calls a function that supports parameters for absolute and relative tolerance @stas00 . I tried adjusting the constant in the Pandas function to use 1.5 too, but then it becomes too lenient and several tests fail (was preparing a pull request because I thought it would be simpler...).
Instead, perhaps, it would be easier to replace the function by either something like the new function in NumPy, or perhaps some other function?
Cheers
Bruno
thank you for digging up the code, @kinow! So the description of the functionality needs to be improved - numpy's version is indeed much better explained.
What it does is comparing how many 0.000x decimals the difference is between 2 numbers, and not how many decimals of each number it looks at. and then there is 1/2...
Let's rewrite:
abs(desired - actual) < (0.5 * 10.0 ** -decimal)
to:
(abs(desired - actual) * 10.0**decimal) < 0.5
so it's easier to understand.
So 2 digits gives us:
(0.6-0.599)*10**2 = 0.1 < 0.5 [True]
(0.6-0.595)*10**2 = 0.5 = 0.5 [False]
(0.6-0.590)*10**2 = 1 > 0.5 [False]
so 2 digits gives us a [0,0.005) absolute range tolerance [0, 0.5*1e-2)
and 3:
(0.6-0.5999)*10**3 = 0.1 < 0.5 [True]
(0.6-0.5995)*10**3 = 0.5 = 0.5 [False]
(0.6-0.5990)*10**3 = 1 > 0.5 [False]
so 3 digits gives us a [0,0.0005) absolute range tolerance [0, 0.5*1e-3)
and so n digits gives us [0, 0.5*1e-n) absolute range tolerance.
So the description should probably use code instead of words:
assert abs(df2-df1)*10**n < 0.5, f"frames difference is equal or more than {0.5*10**-n}"
I hope I didn't miss a zero somewhere.
Except it doesn't seem to be the right function, since if I now apply this same logic to the original failing test to emulate check_less_precise=3
:
import pandas as pd
import pandas.testing
df1 = pd.DataFrame([0.00016, -0.154526, -0.20580199999999998])
df2 = pd.DataFrame([0.00015981824253685772, -0.15452557802200317, -0.20580188930034637])
df3 = abs(df1.subtract(df2))*10**3
df3
#pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=3)
I get:
0 0.000182
1 0.000422
2 0.000111
none of which is >0.5, i.e. it shouldn't assert.
It should assert with check_less_precise=7
or higher, so somewhere 4 decimal places are lost, as it starts asserting with n=3, instead of n=7.
df3 = abs(df1.subtract(df2))*10**6 < 0.5
0 True
1 True
2 True
md5-9f2d90930737c01b4aef55b57cdf40f9
df3 = abs(df1.subtract(df2))*10**7 < 0.5
md5-9f2d90930737c01b4aef55b57cdf40f9
0 False
1 False
2 False
md5-9ac9bc867f19d8d034b0cd8e5279776f
import numpy as np
import numpy.testing
np.testing.assert_array_almost_equal([.00016, -0.154526, -0.20580199999999998],
[0.00015981824253685772, -0.15452557802200317, -0.20580188930034637],
decimal=6)
doesn't fail, with decimal=7
it does - as expected.
I just hit this problem. Unsure if this is still on anyone's radar, but it was pretty surprising for me. I also used numpy functions (np.isclose
instead of np.testing.assert_array_almost_equal
, which I'll move to in the future) to get around it.
If there is interest in updating this parameter, it seems to me like @kinow's suggestion of using these numpy functions is a good path forward, though I'm far from an expert on this.
Dear everybody, any update on this? I'm trying to compare only 2 decimals but it seems it still checks 3...
I just ran into this error with some code I am writing.. I have the check_less
and check_exact
arguments set but still get an assertion error. The message it prints the same numbers out to the maximum print distance of 15 decimal places.
Hi @stas00
thank you for digging up the code, @kinow! So the description of the functionality needs to be improved - numpy's version is indeed much better explained.
+1
Except it doesn't seem to be the right function, since
I'm also starting to think that that function may not be the best for what is documented in assert_frame_equal
. Here's other ways to trigger the error.
import pandas as pd
df1 = pd.DataFrame([0.15])
df2 = pd.DataFrame([0.16])
pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=1)
Or
import pandas as pd
df1 = pd.DataFrame([0.099999])
df2 = pd.DataFrame([0.09]) # 0.099 will apss
pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=1)
The function I mentioned before, is not actually called with these values.
So for a=0.15
, b=0.16
, and decimal=1
, then abs(0.15) < 1e-5)
doesn't pass, and we end up in the else
block. Having then:
if not decimal_almost_equal(1, fb / fa, decimal):
# or
if not decimal_almost_equal(1, 0.15 / 0.16, 1):
# or
if not decimal_almost_equal(1, 0.9375, 1):
# which will be
abs(desired - actual) < (0.5 * 10.0 ** -decimal)
# solving it
abs(1 - 0.9375) < (0.05)
0.0625 < 0.05
In this case, the ratio is not close enough. So the function is failing. However, the callee function was supposed to compare based on the digits after the decimal. So if decimal=1
, from what I understand, it should get 0.15
and 0.16
, and compare only 0.1 == 0.1
, i.e. using only 1 decimal.
If instead of the ratio, we use the function directly with decimal_almost_equal(0.15, 0.16, 1)
, then it will work OK.
However, if we use the other example pair 0.099999
and 0.01
, with decimal=1
.
abs(a - b) < (0.5 * 10.0 ** -decimal)
abs(0.099999 - 0.01) < 0.05
0.08999900000000001 < 0.05
Still fails. Looks like decimal_almost_equal
is not the right function for the comparison? I have a working function in my notebook, but it is using the simplest approach, that truncates the value instead of comparing differences, ratios, etc. Will prepare a PR soon for discussion :+1:
Not super confident that that is the proper solution though, so happy if others chime in with their suggestions.
Hmm, maybe I spoke too fast.
This commit has a unit test with the examples discussed here: https://github.com/kinow/pandas/commit/f45be0e87743a35b3c0e7ca4323e8f7e11149ec7
The test passes, but several other tests fail. For example,
# test_timeseries.test_pct_change_shift_over_nas
def test_pct_change_shift_over_nas(self):
s = Series([1.0, 1.5, np.nan, 2.5, 3.0])
chg = s.pct_change()
expected = Series([np.nan, 0.5, 0.0, 2.5 / 1.5 - 1, 0.2])
tm.assert_series_equal(chg, expected)
Fails with
E AssertionError: Series are different
E
E Series values are different (20.0 %)
E [left]: [nan, 0.5, 0.0, 0.6666666666666667, 0.19999999999999996]
E [right]: [nan, 0.5, 0.0, 0.6666666666666667, 0.2]
The values that fail are 0.19999999999999996
and 0.2
(and check_less_precise=False
, so decimal=5
). Not sure if it is following what's in the docs - maybe we just need to update the docs after all?
check_less_precise : bool or int, default False
Specify comparison precision. Only used when check_exact is False.
5 digits (False) or 3 digits (True) after decimal points are compared.
If int, then specify the digits to compare.
This part is the most confusing for me: "digits (...) after decimal points are compared". If we have 5 digits, and 0.19999999999999996
and 0.2
, the parts after the decimal points are 19999999999999996
, and 2
. Assuming we are to use only the 5 digits, then 19999
and 2
would be compared?
Any updates? It's been several updates, but the problem seems to persist.
Any updates?
Most helpful comment
Any updates?