Pandas: unable to write JSON to S3 use to_json

Created on 10 Sep 2019 · 12Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

S3 paths work for reading and writing CSV. However, I am not able to write json files using the to_json method. Reading json from an S3 path seems to work just fine.

df.to_json(s3uri, orient='values')

Problem description

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
 in 
      1 final_users['user_id'].to_json(s3_path + 'users.json',
----> 2                                orient='values')

~/anaconda3/envs/qa-tool/lib/python3.7/site-packages/pandas/core/generic.py in to_json(self, path_or_buf, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines, compression, index)
   2271                             default_handler=default_handler,
   2272                             lines=lines, compression=compression,
-> 2273                             index=index)
   2274 
   2275     def to_hdf(self, path_or_buf, key, **kwargs):

~/anaconda3/envs/qa-tool/lib/python3.7/site-packages/pandas/io/json/json.py in to_json(path_or_buf, obj, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines, compression, index)
     66 
     67     if isinstance(path_or_buf, compat.string_types):
---> 68         fh, handles = _get_handle(path_or_buf, 'w', compression=compression)
     69         try:
     70             fh.write(s)

~/anaconda3/envs/qa-tool/lib/python3.7/site-packages/pandas/io/common.py in _get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text)
    425         elif is_text:
    426             # Python 3 and no explicit encoding
--> 427             f = open(path_or_buf, mode, errors='replace', newline="")
    428         else:
    429             # Python 3 and binary mode

FileNotFoundError: [Errno 2] No such file or directory: 's3://bucket/folder/foo.json'

Expected Output

None. Expected output is no error and the file is written to the s3 bucket.

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None

pandas: 0.24.2
pytest: None
pip: 19.2.2
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: 1.3.1
pyarrow: None
xarray: None
IPython: 7.7.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: 0.2.1
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

IO JSON IO Network

Source

mas-dse-juremigi

All 12 comments

Thanks for the report. If you'd like to investigate and push a PR to fix would certainly be welcome

WillAyd on 10 Sep 2019

I think you can try use StringIO if you are on python3, python2 should be BytesIO

import boto3
import io
import pandas as pd

jsonBuffer = io.StringIO()

Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
        'Price': [22000,25000,27000,35000]
        }

df = pd.DataFrame(Cars,columns= ['Brand', 'Price'])
session = boto3.Session(aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key)
s3 = session.resource('s3')
df.to_json(outputBuffer, orient='values');
s3.Object(Bucket, 'PATH/TO/test.json').put(Body= jsonBuffer.getvalue())

lanhoter on 14 Sep 2019

Why was this issue closed? The problem still persists

rohan-gt on 26 Jun 2020

@rohitkg98 issues are closed when the corresponding PR is merged
if you have a reproducible example that doesn’t work on master pls open a new issues

jreback on 26 Jun 2020

@jreback This is still broken using the latest pandas 1.0.5 on Anaconda Python 3.6.10 on the Amazon Deep Learning Ubuntu image. it is clearly trying to do a simple open() on the filesystem with no S3 handling whatsoever.

import numpy as np
import pandas as pd
import pyarrow
import s3fs
import torch

...

    df.to_json(
        output_path,
        orient='records',
        lines=True,
    )

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/data/matcha.doodle.backend/ingestion/stackoverflow/encode_records.py", line 95, in encode_records
    lines=True,
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/core/generic.py", line 2364, in to_json
    indent=indent,
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/io/json/_json.py", line 92, in to_json
    fh, handles = get_handle(path_or_buf, "w", compression=compression)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/io/common.py", line 431, in get_handle
    f = open(path_or_buf, mode, errors="replace", newline="")
FileNotFoundError: [Errno 2] No such file or directory: 's3://mypath/0.json'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "ingestion/stackoverflow/encode_bert_docs.py", line 53, in <module>
    main()
  File "ingestion/stackoverflow/encode_bert_docs.py", line 41, in main
    lengths = pool.map(encode_records, device_numbers)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/data/matcha.doodle.backend/ingestion/stackoverflow/encode_records.py", line 95, in encode_records
    lines=True,
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/core/generic.py", line 2364, in to_json
    indent=indent,
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/io/json/_json.py", line 92, in to_json
    fh, handles = get_handle(path_or_buf, "w", compression=compression)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/io/common.py", line 431, in get_handle
    f = open(path_or_buf, mode, errors="replace", newline="")
FileNotFoundError: [Errno 2] No such file or directory: 's3://mypath/0.json'

rjurney on 1 Jul 2020

@rjurney this was merged for 1.1 which has not been released

jreback on 1 Jul 2020

@jreback thanks, any idea when it will be released?

rjurney on 1 Jul 2020

likely in july

jreback on 1 Jul 2020

👀3

@jreback I think with 1.1.0 this functionality is back in fact! however there is an issue when specifying compression.

For example, writing a dataframe to s3 like this:
df.to_json("s3:/bucket_key_path.gz", compression="gzip") writes the file as plain text and not compressed.

Should I create a new issue to tackle this?