Pandas: read_csv should default to index_col = 0

Created on 28 Dec 2018  路  8Comments  路  Source: pandas-dev/pandas

Code Sample

df.to_csv(file_path)
df = pd.read_csv(file_path)

Problem description

Currently, the default CSV writing behaviour is to write the index column. The default reading behaviour, however, is to assume there is no index column in the file. This is not intuitive when writing and reading files.

The expected behaviour is that a file which is written without any index option it should be able to be read without any index option. However, the default writing and reading behaviour results in an unnamed column.

One fix is to change the default index_col for read_csv, another option is to change the default index boolean for to_csv. The former is probably preferable as it preserves information.

API Design IO CSV

Most helpful comment

I tend to agree with @JeroenDelcour. Not only have I seen multiple users get confused by the appearance of unnamed columns, I myself often forget to set the index arguments.

It seems to me we should be prioritizing usability and intuitiveness. Can't we schedule this for release and add a FutureWarning?

All 8 comments

At this point, both of those defaults are unlikely to be changed.

pd.DataFrame.from_csv exists for easier round-tripping, though deprecated, see e.g. #10163

Because it is schema-less, csv is never a particularly safe format for round-tripping, consider using something binary like parquet of HDF5 instead

At this point, both of those defaults are unlikely to be changed.

Agreed. Though might entertain this option more in a super-breaking release like 1.0.

This has been rejected before in #12627 with the following reply:

Has been this way almost since the beginning.

The idea is that .to_csv and .from_csv are inverses

Essentially impossible to change at this point. But to be honest its actually a sensible default. Indexes are more and more important. If you are not using them you should.

However, .from_csv has been deprecated in favor of .read_csv, which would only be the inverse of .to_csv if this default were changed.

I can only interpret the other argument (similar to @chris-b1's reply) to be "because legacy". I personally consider this to be a poor argument in any case. Defaults have been changed before, is there a specific reason this one shouldn't?

Because it is schema-less, csv is never a particularly safe format for round-tripping, consider using something binary like parquet of HDF5 instead

Agreed. However, while CSV may not be the best data format for round-tripping, in practice this is one of the most common use-cases. Many new users are confused as to why unnamed columns appear in their files seemingly at random. After years of using Pandas, I still regularly forget to set the right arguments to allow round-tripping. In my opinion, the inconsistency of the current defaults only adds unnecessary cognitive load.

I think this is a good issue for v0.25.

just personally, I'd be more sympathetic to changing the default on to_csv to index=False, but that has its own set of problems

but that has its own set of problems

Agreed. But the point is well taken that we should pick a suitable (and consistent) default for both to avoid the confusions described above. That being said, it would be good to get some more opinions on what people generally use in the wild (i.e. with an index, without) before settling on one.

Defaults have been changed before, is there a specific reason this one shouldn't?

We aren't saying that it shouldn't, but asking for us to do this in 0.25.0 is rushing things IMO.

asking for us to do this in 0.25.0 is rushing things IMO.

Sorry, I didn't mean to rush it. I'm not very familiar with the Pandas release cycle. As long as it's not pushed back to 1.0.0 - I don't think it's that breaking.

just personally, I'd be more sympathetic to changing the default on to_csv to index=False, but that has its own set of problems

I'm leaning towards this, too, if nothing else because it would match what most users I know already do.

Agreed with https://github.com/pandas-dev/pandas/issues/24468#issuecomment-450406181 that this is too large of a change for us at this point. What do you think @itko?

I tend to agree with @JeroenDelcour. Not only have I seen multiple users get confused by the appearance of unnamed columns, I myself often forget to set the index arguments.

It seems to me we should be prioritizing usability and intuitiveness. Can't we schedule this for release and add a FutureWarning?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

andreas-thomik picture andreas-thomik  路  3Comments

MatzeB picture MatzeB  路  3Comments

Abrosimov-a-a picture Abrosimov-a-a  路  3Comments

Ashutosh-Srivastav picture Ashutosh-Srivastav  路  3Comments

tade0726 picture tade0726  路  3Comments