Bellow you will find a Minimal Verifiable Working Example that reproduces the behaviour I am considering in this issue:
import pandas as pd
# JSON Dump for MWVE:
txt = """[{"channelid":5069,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Cu","timestamp":1514764800000,"userfloatvalue":null},{"channelid":5069,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Cu","timestamp":1514851200000,"userfloatvalue":null},{"channelid":5069,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Cu","timestamp":1514937600000,"userfloatvalue":null},{"channelid":5069,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Cu","timestamp":1515024000000,"userfloatvalue":null},{"channelid":5119,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Cu","timestamp":1514764800000,"userfloatvalue":null},{"channelid":5119,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Cu","timestamp":1514851200000,"userfloatvalue":null},{"channelid":5119,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Cu","timestamp":1514937600000,"userfloatvalue":null},{"channelid":5119,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Cu","timestamp":1515024000000,"userfloatvalue":null},{"channelid":5120,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Pb","timestamp":1514764800000,"userfloatvalue":null},{"channelid":5120,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Pb","timestamp":1514851200000,"userfloatvalue":null},{"channelid":5120,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Pb","timestamp":1514937600000,"userfloatvalue":null},{"channelid":5120,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Pb","timestamp":1515024000000,"userfloatvalue":null},{"channelid":5233,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Pb","timestamp":1514764800000,"userfloatvalue":null},{"channelid":5233,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Pb","timestamp":1514851200000,"userfloatvalue":null},{"channelid":5233,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Pb","timestamp":1514937600000,"userfloatvalue":null},{"channelid":5233,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Pb","timestamp":1515024000000,"userfloatvalue":null}]"""
# Load Data:
df = pd.read_json(txt)
# Filling NaN with string works as expected but downcast column types:
cross2 = df.pivot_table(index="timestamp", columns=["channelid", "networkkey", "sitekey", "measurandkey"], values="userfloatvalue", aggfunc="first", fill_value="nodata")
# Trying to pivot data using MultiIndex and keeping columns of NaN produces all combinations of modalities:
cross3 = df.pivot_table(index="timestamp", columns=["channelid", "networkkey", "sitekey", "measurandkey"], values="userfloatvalue", aggfunc="first", dropna=False)
Trial input looks like (df):
| channelid | measurandkey | networkkey | sitekey | timestamp | userfloatvalue | |
|---|---|---|---|---|---|---|
| 0 | 5069 | Cu | HMT | 01MEU1 | 2018-01-01 | NaN |
| 1 | 5069 | Cu | HMT | 01MEU1 | 2018-01-02 | NaN |
| 2 | 5069 | Cu | HMT | 01MEU1 | 2018-01-03 | NaN |
| 3 | 5069 | Cu | HMT | 01MEU1 | 2018-01-04 | NaN |
| 4 | 5119 | Cu | HMT | 01AND3 | 2018-01-01 | NaN |
| 5 | 5119 | Cu | HMT | 01AND3 | 2018-01-02 | NaN |
| 6 | 5119 | Cu | HMT | 01AND3 | 2018-01-03 | NaN |
| 7 | 5119 | Cu | HMT | 01AND3 | 2018-01-04 | NaN |
| 8 | 5120 | Pb | HMT | 01MEU1 | 2018-01-01 | NaN |
| 9 | 5120 | Pb | HMT | 01MEU1 | 2018-01-02 | NaN |
| 10 | 5120 | Pb | HMT | 01MEU1 | 2018-01-03 | NaN |
| 11 | 5120 | Pb | HMT | 01MEU1 | 2018-01-04 | NaN |
| 12 | 5233 | Pb | HMT | 01AND3 | 2018-01-01 | NaN |
| 13 | 5233 | Pb | HMT | 01AND3 | 2018-01-02 | NaN |
| 14 | 5233 | Pb | HMT | 01AND3 | 2018-01-03 | NaN |
| 15 | 5233 | Pb | HMT | 01AND3 | 2018-01-04 | NaN |
Misbehaved output looks like (cross3):
| channelid | 5069 | 5119 | 5120 | 5233 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| networkkey | HMT | HMT | HMT | HMT | ||||||||||||
| sitekey | 01AND3 | 01MEU1 | 01AND3 | 01MEU1 | 01AND3 | 01MEU1 | 01AND3 | 01MEU1 | ||||||||
| measurandkey | Cu | Pb | Cu | Pb | Cu | Pb | Cu | Pb | Cu | Pb | Cu | Pb | Cu | Pb | Cu | Pb |
| timestamp | ||||||||||||||||
| 2018-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2018-01-02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2018-01-03 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2018-01-04 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Expected output is similar to cross2 but with NaN value instead of string and looks like:
| channelid | 5069 | 5119 | 5120 | 5233 |
|---|---|---|---|---|
| networkkey | HMT | HMT | HMT | HMT |
| sitekey | 01MEU1 | 01AND3 | 01MEU1 | 01AND3 |
| measurandkey | Cu | Cu | Pb | Pb |
| timestamp | ||||
| 2018-01-01 | nodata | nodata | nodata | nodata |
| 2018-01-02 | nodata | nodata | nodata | nodata |
| 2018-01-03 | nodata | nodata | nodata | nodata |
| 2018-01-04 | nodata | nodata | nodata | nodata |
I have the need:
What seems to be the problem, is the creation of all combination of level modalities (instead of keep the existing one only) which drastically increases the amount of Memory without necessity (those combinations are not present in original data).
Maybe it is a bug, maybe it is the designed behaviour. Just wanted to notice it because it has surprised me, and now I am looking to a clean way to circonvolve this behaviour.
How have I found it:
I first had a Memory Error with small queries (about 1000 rows and 25 channels), then I reduced the amount of rows and columns, and I finally dumped it to JSON in order to get the following MVWE above.
To my understanding, the following command:
cross3 = df.pivot_table(index="timestamp", columns=["channelid", "networkkey", "sitekey", "measurandkey"], values="userfloatvalue", aggfunc="first", dropna=False)
Should return the same as:
cross2 = df.pivot_table(index="timestamp", columns=["channelid", "networkkey", "sitekey", "measurandkey"], values="userfloatvalue", aggfunc="first", fill_value="nodata")
cross2bis = cross2.replace('nodata', float('nan'))
A small DataFrame with no extra columns and NaN value not dropped, it should look like:
| channelid | 5069 | 5119 | 5120 | 5233 |
|---|---|---|---|---|
| networkkey | HMT | HMT | HMT | HMT |
| sitekey | 01MEU1 | 01AND3 | 01MEU1 | 01AND3 |
| measurandkey | Cu | Cu | Pb | Pb |
| timestamp | ||||
| 2018-01-01 | NaN | NaN | NaN | NaN |
| 2018-01-02 | NaN | NaN | NaN | NaN |
| 2018-01-03 | NaN | NaN | NaN | NaN |
| 2018-01-04 | NaN | NaN | NaN | NaN |
Without generating combination of level modalities that does not exists in input data.
This will also prevent raising a MemoryError for reasonable amount of data.
pd.show_versions()INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-75-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None
Diving into Pandas source code I found why it behaves like described above.
In DataFrame definition we find a method binding to pivot_table function:
In the function definition:
We can search for dropna switch and see that there is a Cartesian Product of level modalities:
And after all, columns of NaN are purged:
After reading these portions of code, I can tell:
NaN and not to increase cardinality of MultiIndex with the dropna switch, because of the Cartesian Product part.Questions:
Fix proposal:
Actually I need to inhibate the final dropna without triggering the Cartesian Product. And then I guess it is done.
May be a finer condition (or an extra switch) in the Cartesian Product part will do the trick.
@jlandercy thanks for diving in!
My first reflex was to say that this is due to limitations of unstack dealing with existing NaNs vs introduced NaNs due to unstacking (you can search for 'unstack dropna' in the issues to see some related discussions).
But actually, it seems you can get your desired result (I think) with the underlying groupby + unstack:
In [83]: df.groupby(["timestamp", "channelid", "networkkey", "sitekey", "measurandkey"]).agg('first').unstack([1,2,3,4])
Out[83]:
userfloatvalue
channelid 5069 5119 5120 5233
networkkey HMT HMT HMT HMT
sitekey 01MEU1 01AND3 01MEU1 01AND3
measurandkey Cu Cu Pb Pb
timestamp
2018-01-01 NaN NaN NaN NaN
2018-01-02 NaN NaN NaN NaN
2018-01-03 NaN NaN NaN NaN
2018-01-04 NaN NaN NaN NaN
Is it correct that this is what you want?
Dear @jorisvandenbossche,
Yes this is the desired output. Thank you for pointing out this alternative of fill/replace.
Unfortunately, this will not fit in my project as this. Simply because I extensively use pivot_table with a lot of parametrizations. Your solution will introduce a new dataflow I must handle in parallel with pivot_table instead of properly setup its parameters. Or maybe I will have to redesign my code to cope with this issue. But I fell odd about that for the moment. I do like the simplicity of pivot_table method.
What I do not understand is: why there is a Cartesian Product on Level Modalities of index and columns? Do you have any idea? What could be the reason or usage of this feature? I am really intrigued with this. And I found it a little bit counter intuitive by now.
Best regards,
Yes, I hit this problem today. Boy -- So annoying.
We use pivot_table everywhere and this behavior is very limiting.
Is there any ETA for this fix?
Good job by OP in creating a reproducible case... Great job, mate!
ETA is when someone submits a patch, how about it @Sarnath
Challenge accepted. I have no idea about Pandas development. Let me know how to go about it. I will get started! Cheers!
https://pandas.pydata.org/pandas-docs/stable/contributing.html
I will go with this. Let me know if something else is expected.
The docs at /stable are a bit out of date. Try
http://pandas-docs.github.io/pandas-docs-travis/development/contributing.html
On Thu, Jan 24, 2019 at 8:22 AM Sarnath notifications@github.com wrote:
https://pandas.pydata.org/pandas-docs/stable/contributing.html
I will go with this. Let me know if something else is expected.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/18030#issuecomment-457213539,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIpNh-lo-Lkffo7n_AIs3Rw-a3UB6ks5vGcGNgaJpZM4QK-7n
.
Sure, Thanks! I will start in a week. I hope i can find a place where I can sync with other developers. I will first read the guide. Thanks for your time!
Any news on this?
Still open, and will be closed when it's fixed. @marchezinixd are you interested in working on it?
Is there any update on this? Can the cartesian product be turned into a generator so that the memory allocation won't fail? Also, is the reindexing at that point necessary when dropna is False?