I have a shuffled series with a bunch of sinvalues in float16, like this:
tdata.time_sin
110405276 -0.183105
175560878 -0.301270
...
130331292 -0.158813
6782127 -0.282471
Name: time_sin, Length: 18490389, dtype: float16
There's no NaN values, everything's a sinus of something:
tdata.time_sin[np.isnan(tdata.time_sin) == True].count()
0
But for some reason, mean()chokes somewhere in the middle like it's overflowing:
tdata.time_sin.mean()
nan
tdata.time_sin[:328720].mean()
0.0
tdata.time_sin[:328721].mean()
nan
tdata.time_sin[328719:328722]
117467643 -0.639648
85318746 0.956055
10829780 0.112000
Name: time_sin, dtype: float16
And it works fine when converted to float32:
foo = tdata.time_sin.astype(np.float32)
foo.mean()
0.20143597
Is this weird or am I missing something about float16?
This behavior persists after pickling and loading and sorting by index, although it now chokes much earlier:
zzz = pickle.load(open('timesin.pkl', 'rb'))
bb = zzz.sort_index()
bb[:74351].mean()
-0.0
bb[:74352].mean()
nan
bb[74350:74355]
749371 -0.898438
749393 -0.898438
749432 -0.898438
749447 -0.898438
749479 -0.898438
Name: time_sin, dtype: float16
pd.show_versions()pd.show_versions()
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-119-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
float16 is barely supoorted
u can have. a look to improve things
well maybe this explains why my models are not training very well :)
loaded this pickle on another machine, issue repeats exactly
float32 is quite well supported
closing as duplicate of #9220
I met the same problem when I try to reduce the memory-usage of DataFrame according to the data types.
The mean value of the np.float16 is NaN. After I switched the data type to the np.float32, problem solved.
You have an overflow. Take the mean over a ratio,(df[col] / n).mean() * n, where n is large enough.
To know how large n needs to be you can compute the sum of the column once cast into float32, and compare to the largest float16.
@jfpuget You are correct! I was using float16 and while finding mean, sum of all the observations was out of range for float16. Changed the type to float64 and it's working. Thanks!
Most helpful comment
You have an overflow. Take the mean over a ratio,(df[col] / n).mean() * n, where n is large enough.
To know how large n needs to be you can compute the sum of the column once cast into float32, and compare to the largest float16.