ref: https://github.com/pandas-dev/pandas/issues/16135 https://github.com/pandas-dev/pandas/pull/16150
s3_object = client.get_object(Bucket=bucket, Key=key)
result = read_csv(s3_object["Body"])
# ValueError: Invalid file path or buffer object type: <class 'botocore.response.StreamingBody'>
issue: is_file_like requirements are too strict for boto3 S3 objects https://github.com/pandas-dev/pandas/issues/16135
pull request: https://github.com/pandas-dev/pandas/pull/16150
This issue was closed but it's not working. I've found that the test is skipped and has a bug.
I'll send a pull request to reproduce the behavior.
I think it's because botocore.response.StreamingBody
doesn't have __iter__
so is_file_like
returns False
.
commit: c8dcf194c7a5264ed6d9161d5807e9da972cab4c
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL:
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.21.0.dev+317.gc8dcf19
pytest: 3.1.3
pip: 7.1.2
setuptools: 36.2.6
Cython: 0.26
numpy: 1.13.1
scipy: None
pyarrow: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: 0.1.2
pandas_gbq: None
pandas_datareader: None
cc @gfyoung
@uiureo : Thanks for the report! From our conversation back in that issue, we got the impression that the PR would address this issue. Network tests are more difficult to reproduce because they are out of our control, but that's a little unfortunate that the purported patch failed to address this.
One issue currently is that we are facing this delicate balancing act between accepting as many "valid" IO objects as possible without allowing for invalid ones. We check for __iter__
because of #15337, as that is a robust method of rejecting such objects.
For your patch to be robust, you will need to find a way that rejects mock.Mock()
while allowing for StreamingBody
. That would be fantastic if you could!
This issue would be resolved if this PR https://github.com/boto/botocore/pull/1195 is merged at botocore.
@uiureo : Thanks for letting us know! That would definitely be one way to resolve this issue.
It's been widely accepted that as long as an object has read or write function it can be seen as a file-like object.
Much more importantly, this is the behavior before pandas 0.20.*.
A lot of codes relying on this assumption now stop working.
I'm wondering what is the point to re-define what is file-like here? Is the __iter__ actually used in the implementation of read_csv for example?
This is almost 5 month old still open, however it seems neither 0.21.0 or 0.22.0 lets you use the boto3
Is it a design decision to not include it at this point or will we see this capability in future versions ?
we let鈥檚 see we have 2200 issues and only a small number of volunteers - how shall things be prioritized?
submitting a PR would be the fastest way here
I can now confirm that this is no longer an issue with a recent boto3 (the abovementioned PR has been merged). Since boto3 is not a requirement for pandas, there is nothing to done here, is there?
In [16]: obj=client.get_object(Bucket='my-bucket', Key='my/data/foo.csv')
In [17]: obj['Body']
Out[17]: <botocore.response.StreamingBody at 0x7f26ac400d30>
In [18]: df=pandas.read_csv(obj['Body'])
In [19]: len(df)
Out[19]: 7865
@kokes : That's great to hear! Might you be able to construct a unit test for us?
@gfyoung I tried - it's my first code PR, so I'm not sure about things like adding dev dependencies (should I edit all the travis/circleci metadata as well?). Also, not sure if I can just have a test without asserts, since all I'm testing is that it doesn't fail, so any exception should break the test.
Running this with botocore older than 1.10.47 will result in an error, as expected.
if not is_file_like(filepath_or_buffer):
msg = "Invalid file path or buffer object type: {_type}"
> raise ValueError(msg.format(_type=type(filepath_or_buffer)))
E ValueError: Invalid file path or buffer object type: <class 'botocore.response.StreamingBody'>
pandas/io/common.py:232: ValueError
================================================== 1 failed, 1 passed in 0.36 seconds ===================================================
root@114ec59e9d98:/pandas#
@kokes : I would suggest writing the test first and don't worry about the dependencies part yet. We can discuss that once you have written the test.
How does that sound?
@gfyoung we鈥檝e already iterated once with Tom, see #22520
Most helpful comment
@uiureo : Thanks for letting us know! That would definitely be one way to resolve this issue.