Pandas: weird NaN in mean() of float16 series

Created on 9 Apr 2018 · 8Comments · Source: pandas-dev/pandas

I have a shuffled series with a bunch of sinvalues in float16, like this:

   tdata.time_sin
   110405276   -0.183105
   175560878   -0.301270
   ...
   130331292   -0.158813
   6782127     -0.282471
    Name: time_sin, Length: 18490389, dtype: float16

There's no NaN values, everything's a sinus of something:

tdata.time_sin[np.isnan(tdata.time_sin) == True].count()
0

But for some reason, mean()chokes somewhere in the middle like it's overflowing:

tdata.time_sin.mean()
nan

tdata.time_sin[:328720].mean()
0.0

tdata.time_sin[:328721].mean()
nan

tdata.time_sin[328719:328722]
117467643   -0.639648
85318746     0.956055
10829780     0.112000
Name: time_sin, dtype: float16

And it works fine when converted to float32:

foo = tdata.time_sin.astype(np.float32)
foo.mean()

0.20143597

Is this weird or am I missing something about float16?

This behavior persists after pickling and loading and sorting by index, although it now chokes much earlier:

zzz = pickle.load(open('timesin.pkl', 'rb'))
bb = zzz.sort_index()

bb[:74351].mean()
-0.0

bb[:74352].mean()
nan

bb[74350:74355]
749371   -0.898438
749393   -0.898438
749432   -0.898438
749447   -0.898438
749479   -0.898438
Name: time_sin, dtype: float16

Problem description

Expected Output

Output of `pd.show_versions()`

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-119-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Dtypes Duplicate

Source

debdude

👍1

Most helpful comment

You have an overflow. Take the mean over a ratio,(df[col] / n).mean() * n, where n is large enough.

To know how large n needs to be you can compute the sum of the column once cast into float32, and compare to the largest float16.

jfpuget on 19 Feb 2019

👍4

All 8 comments

float16 is barely supoorted

u can have. a look to improve things

jreback on 9 Apr 2018

well maybe this explains why my models are not training very well :)

debdude on 9 Apr 2018

loaded this pickle on another machine, issue repeats exactly

debdude on 9 Apr 2018

float32 is quite well supported

jreback on 9 Apr 2018

closing as duplicate of #9220

jreback on 9 Apr 2018

I met the same problem when I try to reduce the memory-usage of DataFrame according to the data types.
The mean value of the np.float16 is NaN. After I switched the data type to the np.float32, problem solved.

MichaelYin1994 on 27 Nov 2018

You have an overflow. Take the mean over a ratio,(df[col] / n).mean() * n, where n is large enough.

To know how large n needs to be you can compute the sum of the column once cast into float32, and compare to the largest float16.

jfpuget on 19 Feb 2019

👍4

@jfpuget You are correct! I was using float16 and while finding mean, sum of all the observations was out of range for float16. Changed the type to float64 and it's working. Thanks!