Pandas: DataFrame.to_json silently ignores index parameter for most orients.

Created on 1 Mar 2019 · 19Comments · Source: pandas-dev/pandas

see
https://github.com/pandas-dev/pandas/issues/25513#issuecomment-469264925

Code Sample, a copy-pastable example if possible

from __future__ import print_function

import pandas as pd

df = pd.DataFrame.from_records(
    [
        {'id': 'abc', 'data': 'qwerty'},
        {'id': 'def', 'data': 'uiop'}
    ],
    index='id'
)

print(df.to_json(orient='records', index=True))

# Prints out:
[{"data":"qwerty"},{"data":"uiop"}]

series = df.squeeze()
print(series.to_json(orient='records', index=True))

# Prints out:
["qwerty","uiop"]

Problem description

When creating a DataFrame that has two columns, one to be used as an index and another for the data, if you call .to_json(orient='records') the index is omitted. I know that in theory I should be using a Series for this, but I'm using it to convert a CSV file into JSONL and I don't know what the CSV file is going to look like ahead of time.

In any case, squeezing the DataFrame into a Series doesn't work either. In fact, the bug in Series.to_json is even worse, as it produces an array of strings instead of an array of dictionaries.

This bug is present in master.

Expected Output

Expected output is:

[{"id":"abc", "data":"qwerty"},{"id":"def","data":"uiop"}]

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: 4.3.0
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: None
pyarrow: 0.12.1
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: None
s3fs: 0.2.0
fastparquet: 0.2.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Bug IO JSON

Source

dargueta

Most helpful comment

FYI. For those looking for a solution, I got around this by doing a simple: df['index_name'] = df.index before calling df.to_json(...)

ARDivekar on 16 Jan 2020

👍2

All 19 comments

@mroeschke in the function description its clearly mentioned that by using orient = "records" the index is not preserved... so what to here..?
@dargueta u could use df.reset_index(inplace=True) to make the index as a column before using to_json()

coderop2 on 4 Mar 2019

Right, so the index= parameter only works for split and table orients.

I think we should change the default split to None, and

Make the default index be True when orient is split or table.
Raise if index is specified for any other orient.

We should deprecate the current behavior (ignoring index for other orients) first.

TomAugspurger on 4 Mar 2019

I think ignoring index for other orients might not be good because sometimes index like dates is used....so in that case the index might get ignored completely

coderop2 on 4 Mar 2019

@coderop2 can you clarify? My proposal is to stop ignoring the index parameter when it's not valid for the given orient.

TomAugspurger on 4 Mar 2019

in the function description its clearly mentioned that by using orient = "records" the index is not preserved.

@coderop2 I disagree about "clearly". The documentation of the parameter states:

Whether to include the index values in the JSON string. Not including the index (index=False) is only supported when orient is ‘split’ or ‘table’.

I took this to mean the index is always included when the orientation is "records". We should at the very least rephrase this to be more explicit.

@TomAugspurger I agree with most of your proposed solution, though I don't understand why the index should be automatically dropped for orient="records". Dropping it would make sense if the index is the default sequential series, but if I explicitly specify a column as an index, I'd expect it to be included by default.

dargueta on 4 Mar 2019

We wouldn't want data-dependent behavior for whether the index is included.

On Mon, Mar 4, 2019 at 12:06 PM Diego Argueta notifications@github.com
wrote:

in the function description its clearly mentioned that by using orient =
"records" the index is not preserved.

@coderop2 https://github.com/coderop2 I disagree about "clearly". The
documentation of the parameter states:

Whether to include the index values in the JSON string. Not including the
index (index=False) is only supported when orient is ‘split’ or ‘table’.

I took this to mean the index is always included when the orientation
is "records".

@TomAugspurger https://github.com/TomAugspurger I agree with most of
your proposed solution, though I don't understand why the index should be
automatically dropped for orient="records". Dropping it would make sense
if the index is the default sequential series, but if I explicitly
specify a column as an index, I'd expect it to be included by default.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/25513#issuecomment-469354740,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIl-dMOI4AmRz-6EbJWrOCMhDbBwEks5vTWCpgaJpZM4bZsDw
.

TomAugspurger on 4 Mar 2019

Good point.

dargueta on 4 Mar 2019

@TomAugspurger I am sorry I misunderstood for what you were explaining...I understand now
@dargueta I was talking about the description given in the code...it goes like this
Encoding/decoding a Dataframe using ``'records'`` formatted JSON.
Note that index labels are not preserved with this encoding.
@TomAugspurger since I am new to open source can I work up a solution and present it here...if that solution is acceptable then I will create a pull req

coderop2 on 5 Mar 2019

I still think dropping the index is a bit of a nasty surprise, especially if my code gets handed a DataFrame that someone else's code created. I don't want to break _their_ code if I change my serialization format on the back end. Do you think there should be a way to override the default behavior, at least for orient="records"?

dargueta on 5 Mar 2019

@dargueta we need to consider backwards compatibility. Making df.to_json(orient='records') include the index would be an API breaking change.

We'll need a period where providing index=True with the records orient raises. After that time period we can reconsider allowing it.

In the meantime, you'll need to .reset_index() before calling .to_json with the records orient.

TomAugspurger on 5 Mar 2019

@TomAugspurger since I am new to open source can I work up a solution and present it here...if that solution is acceptable then I will create a pull req

That'd be great! Let us know if you need help getting started.

TomAugspurger on 5 Mar 2019

Oh I didn't mean making it the default, just providing a way to override the default behavior so that we can include it if desired without breaking backwards compatibility.

dargueta on 5 Mar 2019

I don't think there's a way though, since the default index is currently
True, right?

On Tue, Mar 5, 2019 at 10:04 AM Diego Argueta notifications@github.com
wrote:

Oh I didn't mean making it the default, just providing a way to
override the default behavior so that we can include it if desired without
breaking backwards compatibility.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/25513#issuecomment-469737412,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIlK3ra6VLIiqn-ybVtscU20H-Pefks5vTpWDgaJpZM4bZsDw
.

TomAugspurger on 5 Mar 2019

Well if we change the default of index to be None like you said earlier, then passing True explicitly would indicate the intent, no?

dargueta on 5 Mar 2019

Can you use the table orient in this case? It at least provides the data that you are looking for, albeit with some extra metadata attached.

We already have quite a few JSON serialization formats so I'd be hesitant to add more, especially given this functionality overlaps already with one of the other formats

WillAyd on 5 Mar 2019

Hmm, yes. I worry a bit about people who were previously passing
index=True... But perhaps that's
not a valid concern.

On Tue, Mar 5, 2019 at 10:40 AM Diego Argueta notifications@github.com
wrote:

Well if we change the default of index to be None like you said earlier,
then passing True explicitly would indicate the intent, no?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/25513#issuecomment-469753148,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIjPmjxRyEGUel0aZkBWwGTJCD63tks5vTp4LgaJpZM4bZsDw
.

TomAugspurger on 5 Mar 2019

Can you use the table orient in this case? It at least provides the data that you are looking for, albeit with some extra metadata attached.

Sadly no, I'm dumping it using lines=True so that it can be read by a database. I didn't put that in my example to minimize the number of arguments that could be interacting.

We already have quite a few JSON serialization formats so I'd be hesitant to add more

I wasn't suggesting adding another format, only changing the behavior of index.

Hmm, yes. I worry a bit about people who were previously passing index=True

I wouldn't say that's an invalid concern, but it does make me question why someone would deliberately pass index=True and orient='records' if they weren't trying to get the behavior I wanted, based on what the documentation states. I imagine the change be an issue for users who're dynamically constructing the arguments and are relying on index being ignored. At the risk of coming off as a pedantic jerk, I'd say relying on that behavior is unwise to begin with.

dargueta on 5 Mar 2019

@TomAugspurger if we are ignoring the index for the records orient then why cannot we just set it to false. But if we do a error is thrown accrding to:
if not index and orient not in ['split', 'table']:
raise ValueError("'index=False' is only valid when 'orient' is "" 'split' or 'table' ")
Because if you include records in the list and pass index as false no error is thrown

coderop2 on 6 Mar 2019

FYI. For those looking for a solution, I got around this by doing a simple: df['index_name'] = df.index before calling df.to_json(...)

ARDivekar on 16 Jan 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

to_sql UnicodeEncodeError

matthiasroder · 3Comments

BUG: fillna with inplace does not work with multiple columns selection by loc

hiiwave · 3Comments

AttributeError: Cannot use pandas from a script file

songololo · 3Comments

Can't read csv using python pandas

Ashutosh-Srivastav · 3Comments

Cannot use apply on Series with Timestamp values

nathanielatom · 3Comments

Pandas: DataFrame.to_json silently ignores index parameter for most orients.

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

Most helpful comment

All 19 comments

Related issues

Output of `pd.show_versions()`