Pandas: What to use instead of pd.read_msgpack/df.to_msgpack

Created on 2 Aug 2019  Â·  14Comments  Â·  Source: pandas-dev/pandas

FutureWarning: to_msgpack is deprecated and will be removed in a future version.
It is recommended to use pyarrow for on-the-wire transmission of pandas objects.

Is there a link/pointer for how to do this?

Most helpful comment

I would also like to keep the to and read msgpack functions. As others have pointed out there is no real replacement if you consider reading and writing speed as well as filesize.

All 14 comments

From the apache arrow docs

import pyarrow as pa
import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3]})
# Convert from pandas to Arrow
table = pa.Table.from_pandas(df)
# Convert back to pandas
df_new = table.to_pandas()

# Infer Arrow schema from pandas
schema = pa.Schema.from_pandas(df)

I think Flight is the recommendation. I'm sure they would appreciate documentation on how that can be done, if you're interested in writing it up. Some of it might be appropriate for inclusion in the pandas docs too.

Oh, and I'm not sure what specifically we should do about the error message. If we had a docs section on using Flight, it could be updated. Need to write the docs first though.

@jreback @TomAugspurger
I have been using to/read_msgpack for quite some time to read large DataFrames ~1-2GB range from filers. In my experiencing if you are opening and reading DataFrames over network shared disks/filers read_msgpack is at least 2-3X faster than any other I/O system Pandas currently has, HDF5, pickle, parquet, etc.

I think this may be because msgpack is de-serializing into pandas columns as its reading over the network thereby saving time compared to a copy and then deserialize approach.

I would love to keep this capability in pandas. As of Pandas 0.25.0 I don't see a direct Arrow/Flight replacement that takes a file path name and writes/reads data as efficiently to/from network storage.

Maintaining the msgpack code was proving to be not worth the effort for us.
You're welcome to take over
maintenance in a separate project, but I don't think it'll live in pandas
itself.

On Wed, Aug 21, 2019 at 4:04 PM Gagi notifications@github.com wrote:

@jreback https://github.com/jreback @TomAugspurger
https://github.com/TomAugspurger
I have been using to/read_msgpack for quite some time to read large
DataFrames ~1-2GB range from filers. In my experiencing if you are opening
and reading DataFrames over network shared disks/filers read_msgpack is at
least 2-3X faster than any other I/O system Pandas currently has, HDF5,
pickle, parquet, etc.

I think this may be because msgpack is de-serializing into pandas columns
as its reading over the network thereby saving time compared to a copy and
then deserialize approach.

I would love to keep this capability in pandas. As of Pandas 0.25.0 I
don't see a direct Arrow/Flight replacement that takes a file path name and
writes/reads data as efficiently to/from network storage.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/27722?email_source=notifications&email_token=AAKAOIVQDYXJI6RJG4CZ5ETQFWUUNA5CNFSM4IJBFGHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD43D3NQ#issuecomment-523648438,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKAOIUOSMYH3CH4SVHL2B3QFWUUNANCNFSM4IJBFGHA
.

there is actually a pandas-msgpack repo / package available
just needs updating

arrow IPC should be actually more efficient as this as there are no copies

@jreback
Thanks Jeff. Glad to see the repo exists! Will the pandas-msgpack repo get the latest pandas msgpack code after its deprecation? It looks like this could be the perfect place to save this code and leave it for use cases with simple numeric and string data type columns.

@jreback
Thanks Jeff. Glad to see the repo exists! Will the pandas-msgpack repo get the latest pandas msgpack code after its deprecation? It looks like this could be the perfect place to save this code and leave it for use cases with simple numeric and string data type columns.

if someone does this sure

I've hacked away the bitrot in the pandas-msgpack package to work with pandas 0.25.0 at least: https://github.com/TyberiusPrime/pandas-msgpack. No promises, this is a mostly mechanical - didn't stop to understand the code kind of hack, but it does read my msgpacks and passes the tests that didn't involve datetimes or spares arrays.

(I need msgpack because we store tuple columns in our dataframe, and pyarrow doesn't support that).

Can anyone leave an example of how they've used pyarrow to replace read_msgpack?

Example:

import pandas as pd
import pyarrow as pa

data = b'some kind of dataframe object converted to bytes'

pd.read_msgpack(data) # Returns a df

pa.???.to_dataframe()  # To read byte object into pandas dataframe

@nathanchiu34

Try this:

import pyarrow as pa

save_file = "/path/to/save/filename.pyarrow"
df = dataframe_you_want_to_save

with open(save_file, "wb") as file:
    # use pa.serialize().to_buffer() instead of dataframe.to_feather() to dump dataframe 
    # without strange type not implement error which like: 
    # "ArrowNotImplementedError: list<item: double>"
    pyarrow_dumped = pa.serialize(df).to_buffer()
    file.write(pyarrow_dumped)

with open(save_file, "rb") as file:
    # read dumped file
    pyarrow_dumped = file.read()

# deserialize dumped file
result = pa.deserialize(pyarrow_dumped)

# show dumped file contents
display(result)

I would also like to keep the to and read msgpack functions. As others have pointed out there is no real replacement if you consider reading and writing speed as well as filesize.

I loved using msgpack. Sorry to see it go. I did a study last year comparing all the pandas serialization options and found that msgpack had the best bang for the buck. I will now reverentially refer to pandas 0.25 as the "python 2.7 of pandas".

Was this page helpful?
0 / 5 - 0 ratings