Pandas: BUG: Summation of NaT in a DataFrame with axis=1 does not return NaT

Created on 12 Aug 2017  路  13Comments  路  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

import pandas as pd
import io

dat = """s1;s2
1000;2000
3000;1500
500"""

df = pd.read_csv(io.StringIO(dat), sep=";")
df.index = df.index + 1
df = df.apply(lambda x: pd.to_timedelta(x, unit='ms'))

df['laptime'] = df.sum(axis=1, skipna=False)

print(df)

Problem description

This code displays

               s1              s2                       laptime
1        00:00:01        00:00:02               0 days 00:00:03
2        00:00:03 00:00:01.500000        0 days 00:00:04.500000
3 00:00:00.500000             NaT -106752 days +00:12:43.645223
In [81]: df['laptime'][3]
Out[81]: Timedelta('-106752 days +00:12:43.645223')

Expected Output

               s1              s2                       laptime
1        00:00:01        00:00:02               0 days 00:00:03
2        00:00:03 00:00:01.500000        0 days 00:00:04.500000
3 00:00:00.500000             NaT                           NaT
In [81]: df['laptime'][3]
Out[81]: NaT

That's very strange because pd.to_timedelta(500, unit='ms') + pd.NaT returns NaT (which is expected) but it's doesn't work as expected when summing Series)

Output of pd.show_versions()

: pd.show_versions()
/Users/scls/anaconda/lib/python3.6/site-packages/xarray/core/formatting.py:16: FutureWarning: The pandas.tslib module is deprecated and will be removed in a future version.
from pandas.tslib import OutOfBoundsDatetime

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: fr_FR.UTF-8

pandas: 0.20.1
pytest: 3.1.2
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: 0.9.5
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.5.0

Bug Missing-data Numeric Reductions Timeseries

All 13 comments

you are specifying skipna=False, which is not the default, thus it is NOT skipping nans.

00:00:00.500000 + NaT should return NaT (not -106752 days +00:12:43.645223)

with skipna=True

In [110]: df.sum(axis=1, skipna=True)
Out[110]:
1          00:00:03
2   00:00:04.500000
3   00:00:00.500000
dtype: timedelta64[ns]

so whatever skipna value is, I never get NaT which is expected result

In [8]: pd.Timedelta('00:00:00.500000') + pd.NaT 
Out[8]: NaT

again you are specifying skipna=True. This by-definition skips NA (and NaT)

and if I don't skip NA (and NaT) , so skipna=False I should get NaT for laptime 3... not -106752 days +00:12:43.645223

What means -106752 days +00:12:43.645223?

According doc

skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA

so if skipna=False I expect to not exclude NA/null values and as you point it I expect a kind of "absorptivity" of NaT (see row 3)

               s1              s2                       laptime
   ...
3 00:00:00.500000             NaT -106752 days +00:12:43.645223

df.sum(axis=1, skipna=False) should return same result than df['s1'] + df['s2'] which is not the case

@jreback - I think this is buggy, we're not respecting NaT sematics, but instead including the sentinel value in the sum.

In [3]: df['s2']
Out[3]: 
1          00:00:02
2   00:00:01.500000
3               NaT
Name: s2, dtype: timedelta64[ns]

In [4]: df['s2'].sum()
Out[4]: Timedelta('0 days 00:00:03.500000')

In [5]: df['s2'].sum(skipna=False)
Out[5]: Timedelta('-106752 days +00:12:46.645224')

In [6]: df['s2'].values.sum()
Out[6]: numpy.timedelta64('NaT','ns')

these are converted to i8 before adding (and then masked which is not happening i guess)

Looks to be fixed on master now. Could use a regression test.

In [20]: df['s2'].sum(skipna=False)
Out[20]: NaT

In [22]: pd.__version__
Out[22]: '0.26.0.dev0+519.gdf2e0813e'

Looks to be fixed on master now. Could use a regression test.

In [20]: df['s2'].sum(skipna=False)
Out[20]: NaT

In [22]: pd.__version__
Out[22]: '0.26.0.dev0+519.gdf2e0813e'

'''

df["s2"].sum(skipna=False)
NaT

df["s1"]+df["s2"]
1 00:00:03
2 00:00:04.500000
3 NaT

df[["s1","s2"]].sum(axis=1,skipna=False)
1 0 days 00:00:03
2 0 days 00:00:04.500000
3 -106752 days +00:12:43.645223

pd.__version__
'0.26.0.dev0+734.g0de99558b'
'''

sum of s2 gives NaT but sum(axis=1,skipna=False) still does not give NaT. I believe this still needs a fix?

Ah good catch @ainsleyto. Yes looks like axis=1 case still needs a fix.

df['s2'] works i think bc that dispatches to the TimedeltaArray implementation, which im pretty sure the dataframe version does not do (for either axis=0 or axis=1)

From poking at this, this can be fixed by removing from nanops.nansum

    if is_float_dtype(dtype):
        dtype_sum = dtype
-    elif is_timedelta64_dtype(dtype):
-        dtype_sum = np.float64

and adjusting nanops._wrap_results to not expect float64. The trouble is that doing so disables overflow checking.

Thanks @jbrockmendel and all for this fix

Was this page helpful?
0 / 5 - 0 ratings