Pandas: Cannot write partitioned parquet file to S3

Created on 26 Jul 2019 · 9Comments · Source: pandas-dev/pandas

Apologies if this is a pyarrow issue.

Code Sample, a copy-pastable example if possible

pd.DataFrame({'a': range(5), 'b': range(5)}).to_parquet('s3://mybucket', partition_cols=['b'])

Problem description

Fails with AttributeError: 'NoneType' object has no attribute '_isfilestore'

Traceback (most recent call last):
  File "/python/partparqs3.py", line 8, in <module>
    pd.DataFrame({'a': range(5), 'b': range(5)}).to_parquet('s3://mybucket', partition_cols=['b'])
  File "/python/lib/python3.7/site-packages/pandas/core/frame.py", line 2203, in to_parquet
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 252, in to_parquet
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 118, in write
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 1342, in write_to_dataset
    _mkdir_if_not_exists(fs, root_path)
  File "/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 1292, in _mkdir_if_not_exists
    if fs._isfilestore() and not fs.exists(path):
AttributeError: 'NoneType' object has no attribute '_isfilestore'
Exception ignored in: <function AbstractBufferedFile.__del__ at 0x7f529985ca60>
Traceback (most recent call last):
  File "/python/lib/python3.7/site-packages/fsspec/spec.py", line 1146, in __del__
    self.close()
  File "/python/lib/python3.7/site-packages/fsspec/spec.py", line 1124, in close
    self.flush(force=True)
  File "/python/lib/python3.7/site-packages/fsspec/spec.py", line 996, in flush
    self._initiate_upload()
  File "/python/lib/python3.7/site-packages/s3fs/core.py", line 941, in _initiate_upload
    Bucket=bucket, Key=key, ACL=self.acl)
  File "/python/lib/python3.7/site-packages/s3fs/core.py", line 928, in _call_s3
    **kwargs)
  File "/python/lib/python3.7/site-packages/s3fs/core.py", line 182, in _call_s3
    return method(**additional_kwargs)
  File "/python/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/python/lib/python3.7/site-packages/botocore/client.py", line 648, in _make_api_call
    operation_model, request_dict, request_context)
  File "/python/lib/python3.7/site-packages/botocore/client.py", line 667, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/python/lib/python3.7/site-packages/botocore/endpoint.py", line 102, in make_request
    return self._send_request(request_dict, operation_model)
  File "/python/lib/python3.7/site-packages/botocore/endpoint.py", line 137, in _send_request
    success_response, exception):
  File "/python/lib/python3.7/site-packages/botocore/endpoint.py", line 231, in _needs_retry
    caught_exception=caught_exception, request_dict=request_dict)
  File "/python/lib/python3.7/site-packages/botocore/hooks.py", line 356, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/python/lib/python3.7/site-packages/botocore/hooks.py", line 228, in emit
    return self._emit(event_name, kwargs)
  File "/python/lib/python3.7/site-packages/botocore/hooks.py", line 211, in _emit
    response = handler(**kwargs)
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 183, in __call__
    if self._checker(attempts, response, caught_exception):
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 251, in __call__
    caught_exception)
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 269, in _should_retry
    return self._checker(attempt_number, response, caught_exception)
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 317, in __call__
    caught_exception)
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 223, in __call__
    attempt_number, caught_exception)
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 359, in _check_caught_exception
    raise caught_exception
  File "/python/lib/python3.7/site-packages/botocore/endpoint.py", line 200, in _do_get_response
    http_response = self._send(request)
  File "/python/lib/python3.7/site-packages/botocore/endpoint.py", line 244, in _send
    return self.http_session.send(request)
  File "/python/lib/python3.7/site-packages/botocore/httpsession.py", line 294, in send
    raise HTTPClientError(error=e)
botocore.exceptions.HTTPClientError: An HTTP Client raised and unhandled exception: 'NoneType' object is not iterable

Expected Output

Expected to see partitioned data show up in S3.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-957.21.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.0.3
setuptools: 41.0.0
Cython: 0.29.7
numpy: 1.16.2
scipy: 1.3.0
pyarrow: 0.14.0
xarray: None
IPython: 7.5.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.3
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: 0.3.0
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Bug IO Parquet

Source

jkleint

Most helpful comment

Writing partitioned parquet to S3 is still an issue with Pandas 1.0.1, pyarrow 0.16, and s3fs 0.4.

@TomAugspurger the root_path passed to write_to_dataset looks like <File-like object S3FileSystem, mybucket>.

@getsanjeevdubey you can work around this by giving PyArrow an S3FileSystem directly:

import pandas as pd
import pyarrow
import pyarrow.parquet as pq
import s3fs

pq.write_to_dataset(pyarrow.Table.from_pandas(dataframe), s3bucket, 
                    filesystem=s3fs.S3FileSystem(), partition_cols=['b'])

Of course you'll have to special-case this for S3 paths vs. other destinations for .to_parquet().

jkleint on 25 Feb 2020

👍3

All 9 comments

Can you verify that the path we pass to write_to_dataset in

  File "/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 252, in to_parquet
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 118, in write
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 1342, in write_to_dataset
    _mkdir_if_not_exists(fs, root_path)

is correct? pyarrow may want a FileSystem-type thing.

TomAugspurger on 29 Jul 2019

https://issues.apache.org/jira/browse/ARROW-5156

cottrell on 12 Sep 2019

It sounds like usage of s3fs should largely be replaced with fsspec. Can somebody confirm that is true? I think the bug here is probably some clean up in the io/parquet.py to do with that but there might be plans already in progress?

cottrell on 12 Sep 2019

fsspec is a dependency of s3fs. It provides the backend-agnostic parts to
various filesystem-like things. s3fs is still the only relevant dependency
for pandas.

On Thu, Sep 12, 2019 at 8:50 AM David Cottrell notifications@github.com
wrote:

It sounds like usage of s3fs should largely be replaced with fsspec. Can
somebody confirm that is true? I think the bug here is probably some clean
up in the io/parquet.py to do with that but there might be plans already in
progress?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/27596?email_source=notifications&email_token=AAKAOIREUL4LYOGYZ422BNLQJJCKLA5CNFSM4IG7BWM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6R6QRQ#issuecomment-530835526,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKAOITXVDAZEFXZGA72OLTQJJCKLANCNFSM4IG7BWMQ
.

TomAugspurger on 12 Sep 2019

👍1

@TomAugspurger @cottrell Is this fixed, whats the work around? Please help.

getsanjeevdubey on 1 Feb 2020

@getsanjeevdubey I think stills opened. You should write into disk and upload files to s3 manually

daviddelucca on 3 Feb 2020

👎1 👍1