Pandas: weird NaN in mean() of float16 series

Created on 9 Apr 2018  路  8Comments  路  Source: pandas-dev/pandas

I have a shuffled series with a bunch of sinvalues in float16, like this:

   tdata.time_sin
   110405276   -0.183105
   175560878   -0.301270
   ...
   130331292   -0.158813
   6782127     -0.282471
    Name: time_sin, Length: 18490389, dtype: float16

There's no NaN values, everything's a sinus of something:

tdata.time_sin[np.isnan(tdata.time_sin) == True].count()
0

But for some reason, mean()chokes somewhere in the middle like it's overflowing:

tdata.time_sin.mean()
nan

tdata.time_sin[:328720].mean()
0.0

tdata.time_sin[:328721].mean()
nan

tdata.time_sin[328719:328722]
117467643   -0.639648
85318746     0.956055
10829780     0.112000
Name: time_sin, dtype: float16

And it works fine when converted to float32:

foo = tdata.time_sin.astype(np.float32)
foo.mean()

0.20143597

Is this weird or am I missing something about float16?

This behavior persists after pickling and loading and sorting by index, although it now chokes much earlier:

zzz = pickle.load(open('timesin.pkl', 'rb'))
bb = zzz.sort_index()

bb[:74351].mean()
-0.0

bb[:74352].mean()
nan

bb[74350:74355]
749371   -0.898438
749393   -0.898438
749432   -0.898438
749447   -0.898438
749479   -0.898438
Name: time_sin, dtype: float16

Problem description

Expected Output

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-119-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Dtypes Duplicate

Most helpful comment

You have an overflow. Take the mean over a ratio,(df[col] / n).mean() * n, where n is large enough.

To know how large n needs to be you can compute the sum of the column once cast into float32, and compare to the largest float16.

All 8 comments

float16 is barely supoorted

u can have. a look to improve things

well maybe this explains why my models are not training very well :)

loaded this pickle on another machine, issue repeats exactly

float32 is quite well supported

closing as duplicate of #9220

I met the same problem when I try to reduce the memory-usage of DataFrame according to the data types.
The mean value of the np.float16 is NaN. After I switched the data type to the np.float32, problem solved.

You have an overflow. Take the mean over a ratio,(df[col] / n).mean() * n, where n is large enough.

To know how large n needs to be you can compute the sum of the column once cast into float32, and compare to the largest float16.

@jfpuget You are correct! I was using float16 and while finding mean, sum of all the observations was out of range for float16. Changed the type to float64 and it's working. Thanks!

Was this page helpful?
0 / 5 - 0 ratings