S3 paths work for reading and writing CSV. However, I am not able to write json files using the to_json method. Reading json from an S3 path seems to work just fine.
df.to_json(s3uri, orient='values')
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
in
1 final_users['user_id'].to_json(s3_path + 'users.json',
----> 2 orient='values')
~/anaconda3/envs/qa-tool/lib/python3.7/site-packages/pandas/core/generic.py in to_json(self, path_or_buf, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines, compression, index)
2271 default_handler=default_handler,
2272 lines=lines, compression=compression,
-> 2273 index=index)
2274
2275 def to_hdf(self, path_or_buf, key, **kwargs):
~/anaconda3/envs/qa-tool/lib/python3.7/site-packages/pandas/io/json/json.py in to_json(path_or_buf, obj, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines, compression, index)
66
67 if isinstance(path_or_buf, compat.string_types):
---> 68 fh, handles = _get_handle(path_or_buf, 'w', compression=compression)
69 try:
70 fh.write(s)
~/anaconda3/envs/qa-tool/lib/python3.7/site-packages/pandas/io/common.py in _get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text)
425 elif is_text:
426 # Python 3 and no explicit encoding
--> 427 f = open(path_or_buf, mode, errors='replace', newline="")
428 else:
429 # Python 3 and binary mode
FileNotFoundError: [Errno 2] No such file or directory: 's3://bucket/folder/foo.json'
None. Expected output is no error and the file is written to the s3 bucket.
pd.show_versions()[paste the output of pd.show_versions() here below this line]
commit: None
pandas: 0.24.2
pytest: None
pip: 19.2.2
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: 1.3.1
pyarrow: None
xarray: None
IPython: 7.7.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: 0.2.1
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
Thanks for the report. If you'd like to investigate and push a PR to fix would certainly be welcome
I think you can try use StringIO if you are on python3, python2 should be BytesIO
import boto3
import io
import pandas as pd
jsonBuffer = io.StringIO()
Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'Price': [22000,25000,27000,35000]
}
df = pd.DataFrame(Cars,columns= ['Brand', 'Price'])
session = boto3.Session(aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key)
s3 = session.resource('s3')
df.to_json(outputBuffer, orient='values');
s3.Object(Bucket, 'PATH/TO/test.json').put(Body= jsonBuffer.getvalue())
Why was this issue closed? The problem still persists
@rohitkg98 issues are closed when the corresponding PR is merged
if you have a reproducible example that doesn鈥檛 work on master pls open a new issues
@jreback This is still broken using the latest pandas 1.0.5 on Anaconda Python 3.6.10 on the Amazon Deep Learning Ubuntu image. it is clearly trying to do a simple open() on the filesystem with no S3 handling whatsoever.
import numpy as np
import pandas as pd
import pyarrow
import s3fs
import torch
...
df.to_json(
output_path,
orient='records',
lines=True,
)
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/data/matcha.doodle.backend/ingestion/stackoverflow/encode_records.py", line 95, in encode_records
lines=True,
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/core/generic.py", line 2364, in to_json
indent=indent,
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/io/json/_json.py", line 92, in to_json
fh, handles = get_handle(path_or_buf, "w", compression=compression)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/io/common.py", line 431, in get_handle
f = open(path_or_buf, mode, errors="replace", newline="")
FileNotFoundError: [Errno 2] No such file or directory: 's3://mypath/0.json'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "ingestion/stackoverflow/encode_bert_docs.py", line 53, in <module>
main()
File "ingestion/stackoverflow/encode_bert_docs.py", line 41, in main
lengths = pool.map(encode_records, device_numbers)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/data/matcha.doodle.backend/ingestion/stackoverflow/encode_records.py", line 95, in encode_records
lines=True,
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/core/generic.py", line 2364, in to_json
indent=indent,
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/io/json/_json.py", line 92, in to_json
fh, handles = get_handle(path_or_buf, "w", compression=compression)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/io/common.py", line 431, in get_handle
f = open(path_or_buf, mode, errors="replace", newline="")
FileNotFoundError: [Errno 2] No such file or directory: 's3://mypath/0.json'
@rjurney this was merged for 1.1 which has not been released
@jreback thanks, any idea when it will be released?
likely in july
@jreback I think with 1.1.0 this functionality is back in fact! however there is an issue when specifying compression.
For example, writing a dataframe to s3 like this:
df.to_json("s3:/bucket_key_path.gz", compression="gzip") writes the file as plain text and not compressed.
Should I create a new issue to tackle this?
you can create a new issue
@manugarri, HAve your already created the issue? Can you link it? Thanks
@imanebosch i did not, i ended up using dask directly.