import pandas as pd
import io
dat = """s1;s2
1000;2000
3000;1500
500"""
df = pd.read_csv(io.StringIO(dat), sep=";")
df.index = df.index + 1
df = df.apply(lambda x: pd.to_timedelta(x, unit='ms'))
df['laptime'] = df.sum(axis=1, skipna=False)
print(df)
This code displays
s1 s2 laptime
1 00:00:01 00:00:02 0 days 00:00:03
2 00:00:03 00:00:01.500000 0 days 00:00:04.500000
3 00:00:00.500000 NaT -106752 days +00:12:43.645223
In [81]: df['laptime'][3]
Out[81]: Timedelta('-106752 days +00:12:43.645223')
s1 s2 laptime
1 00:00:01 00:00:02 0 days 00:00:03
2 00:00:03 00:00:01.500000 0 days 00:00:04.500000
3 00:00:00.500000 NaT NaT
In [81]: df['laptime'][3]
Out[81]: NaT
That's very strange because pd.to_timedelta(500, unit='ms') + pd.NaT returns NaT (which is expected) but it's doesn't work as expected when summing Series)
pd.show_versions(): pd.show_versions()
/Users/scls/anaconda/lib/python3.6/site-packages/xarray/core/formatting.py:16: FutureWarning: The pandas.tslib module is deprecated and will be removed in a future version.
from pandas.tslib import OutOfBoundsDatetime
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: fr_FR.UTF-8
pandas: 0.20.1
pytest: 3.1.2
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: 0.9.5
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.5.0
you are specifying skipna=False, which is not the default, thus it is NOT skipping nans.
00:00:00.500000 + NaT should return NaT (not -106752 days +00:12:43.645223)
with skipna=True
In [110]: df.sum(axis=1, skipna=True)
Out[110]:
1 00:00:03
2 00:00:04.500000
3 00:00:00.500000
dtype: timedelta64[ns]
so whatever skipna value is, I never get NaT which is expected result
In [8]: pd.Timedelta('00:00:00.500000') + pd.NaT
Out[8]: NaT
again you are specifying skipna=True. This by-definition skips NA (and NaT)
and if I don't skip NA (and NaT) , so skipna=False I should get NaT for laptime 3... not -106752 days +00:12:43.645223
What means -106752 days +00:12:43.645223?
According doc
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result
will be NA
so if skipna=False I expect to not exclude NA/null values and as you point it I expect a kind of "absorptivity" of NaT (see row 3)
s1 s2 laptime
...
3 00:00:00.500000 NaT -106752 days +00:12:43.645223
df.sum(axis=1, skipna=False) should return same result than df['s1'] + df['s2'] which is not the case
@jreback - I think this is buggy, we're not respecting NaT sematics, but instead including the sentinel value in the sum.
In [3]: df['s2']
Out[3]:
1 00:00:02
2 00:00:01.500000
3 NaT
Name: s2, dtype: timedelta64[ns]
In [4]: df['s2'].sum()
Out[4]: Timedelta('0 days 00:00:03.500000')
In [5]: df['s2'].sum(skipna=False)
Out[5]: Timedelta('-106752 days +00:12:46.645224')
In [6]: df['s2'].values.sum()
Out[6]: numpy.timedelta64('NaT','ns')
these are converted to i8 before adding (and then masked which is not happening i guess)
Looks to be fixed on master now. Could use a regression test.
In [20]: df['s2'].sum(skipna=False)
Out[20]: NaT
In [22]: pd.__version__
Out[22]: '0.26.0.dev0+519.gdf2e0813e'
Looks to be fixed on master now. Could use a regression test.
In [20]: df['s2'].sum(skipna=False) Out[20]: NaT In [22]: pd.__version__ Out[22]: '0.26.0.dev0+519.gdf2e0813e'
'''
df["s2"].sum(skipna=False)
NaT
df["s1"]+df["s2"]
1 00:00:03
2 00:00:04.500000
3 NaT
df[["s1","s2"]].sum(axis=1,skipna=False)
1 0 days 00:00:03
2 0 days 00:00:04.500000
3 -106752 days +00:12:43.645223
pd.__version__
'0.26.0.dev0+734.g0de99558b'
'''
sum of s2 gives NaT but sum(axis=1,skipna=False) still does not give NaT. I believe this still needs a fix?
Ah good catch @ainsleyto. Yes looks like axis=1 case still needs a fix.
df['s2'] works i think bc that dispatches to the TimedeltaArray implementation, which im pretty sure the dataframe version does not do (for either axis=0 or axis=1)
From poking at this, this can be fixed by removing from nanops.nansum
if is_float_dtype(dtype):
dtype_sum = dtype
- elif is_timedelta64_dtype(dtype):
- dtype_sum = np.float64
and adjusting nanops._wrap_results to not expect float64. The trouble is that doing so disables overflow checking.
Thanks @jbrockmendel and all for this fix