Using test.jsonl as the following:
{"test1":10}
{"test2":"hi"}
with open('test.jsonl', 'rb') as f:
df = pd.read_json(f, lines=True, chunksize=1)
for chunk in df:
print(chunk)
When I try to use read_json on a binary file with lines=True and chunksize set, I get the following error:
TypeError: sequence item 0: expected str instance, bytes found
It works when I remove the chunksize parameter. I'm guessing that the chunksize isn't taking into account the fact that the file is being read as a binary file.
I expect to be able to read a file using lines=True and chunksize
pd.show_versions()commit : None
pandas : 0.25.1
numpy : 1.15.2
pytz : 2019.1
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : None
pytest : 5.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.0.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.0
pytables : None
s3fs : 0.3.5
scipy : 1.2.0
sqlalchemy : 1.3.9
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
Any update here? I'm happy to work on this, if someone could point me in the right direction
Just out of curiosity, any specific reason you need to read as a binary file? Does your .jsonl written in binary format? Since you stated your expectation is to be able to read a file using lines=True and chunksize which is could be accomplished if you use with open('test.jsonl', 'r')
JSON is by definition a text-based format. I'm not fully aware of the specs for binary-JSON but I think might be out of scope of best left to a separate package to support
The reason I'm reading it as a binary file is that it's on S3, so I'm using S3FS (https://github.com/dask/s3fs) to access the file handle, which only supports opening the files in a binary format.
Is the recommendation then to copy the file locally and then read into pandas? If so, that's a bit painful since the whole point of me wanting to use chunksize is that I don't want to have to handle the entire file at once :)
I was having a similar issue...it seems to be that when the data is chunked, it's being read as a byte string, and apparently StringIO doesn't like that.
I've put in a pull request #30989 which should fix it (I hope!)
I am having similar issue. When I read large JSON file from S3 with lines=True, chunkize=10000, it complains with "TypeError: sequence item 0: expected str instance, bytes found"
Most helpful comment
I was having a similar issue...it seems to be that when the data is chunked, it's being read as a byte string, and apparently StringIO doesn't like that.
I've put in a pull request #30989 which should fix it (I hope!)