Pandas: read_json doesn't work on binary files with lines=True and chunksize

Created on 10 Oct 2019 · 6Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

Using test.jsonl as the following:

{"test1":10}
{"test2":"hi"}

with open('test.jsonl', 'rb') as f:
    df = pd.read_json(f, lines=True, chunksize=1)
    for chunk in df:
        print(chunk)

Problem description

When I try to use read_json on a binary file with lines=True and chunksize set, I get the following error:
TypeError: sequence item 0: expected str instance, bytes found

It works when I remove the chunksize parameter. I'm guessing that the chunksize isn't taking into account the fact that the file is being read as a binary file.

Expected Output

I expect to be able to read a file using lines=True and chunksize

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None

pandas : 0.25.1
numpy : 1.15.2
pytz : 2019.1
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : None
pytest : 5.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.0.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.0
pytables : None
s3fs : 0.3.5
scipy : 1.2.0
sqlalchemy : 1.3.9
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

Enhancement IO JSON Needs Discussion

Source

nlarusstone

👍1

Most helpful comment

I was having a similar issue...it seems to be that when the data is chunked, it's being read as a byte string, and apparently StringIO doesn't like that.

I've put in a pull request #30989 which should fix it (I hope!)

copeland3300 on 14 Jan 2020

👍2

All 6 comments

Any update here? I'm happy to work on this, if someone could point me in the right direction

nlarusstone on 21 Oct 2019

Just out of curiosity, any specific reason you need to read as a binary file? Does your .jsonl written in binary format? Since you stated your expectation is to be able to read a file using lines=True and chunksize which is could be accomplished if you use with open('test.jsonl', 'r')

pangeran-bottor on 30 Oct 2019

JSON is by definition a text-based format. I'm not fully aware of the specs for binary-JSON but I think might be out of scope of best left to a separate package to support

WillAyd on 30 Oct 2019

The reason I'm reading it as a binary file is that it's on S3, so I'm using S3FS (https://github.com/dask/s3fs) to access the file handle, which only supports opening the files in a binary format.

Is the recommendation then to copy the file locally and then read into pandas? If so, that's a bit painful since the whole point of me wanting to use chunksize is that I don't want to have to handle the entire file at once :)

nlarusstone on 1 Nov 2019

👍1

I was having a similar issue...it seems to be that when the data is chunked, it's being read as a byte string, and apparently StringIO doesn't like that.

I've put in a pull request #30989 which should fix it (I hope!)

copeland3300 on 14 Jan 2020

👍2

I am having similar issue. When I read large JSON file from S3 with lines=True, chunkize=10000, it complains with "TypeError: sequence item 0: expected str instance, bytes found"