Pandas: allow detection of whether a csv file contains a header row or not

Created on 26 Dec 2018 · 6Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

df = pd.DataFrame({'A':[0,1,2,3], 'B':[0.5,0.2,0.3,0.2], 'C':['a','b','c','d']})
df.to_csv('file.csv', header=True, index=False)
hd= HeaderDetector()
df = pd.read_csv('file.csv', detect_header_obj=hd, detect_header=True) 
return list(df.columns)

Problem description

There are cases where you do not have time to check whether the csv you want to read contains a header row or not (i.e. performing fully automatic analytics on a large amount if files).
I suggest allowing to pass an object which will be responsible for detecting whether a file has a header row or not. the object will implement has_header in a similar manner to the csv.Sniffer module.

Expected Output

Output of the snippet above

['A', 'B', 'C']

Closing Candidate IO CSV Needs Discussion

Source

Iddoyadlin

Most helpful comment

Is there any reason not to use the builtin CSV sniffer, and pass the output

with open("file.csv") as f:
    header = csv.Sniffer(f).has_header(1024)

pd.read_csv("file.csv", header=header)

On its own that doesn't seem too bad. I worry about the explosion in complexity you get when you have all of pandas' other keywords / capabilities (how would that work for network files? How does it interact with row-based things like skiprows?, etc.)

TomAugspurger on 27 Dec 2018

👍3

All 6 comments

I am -1 here as I don't think this is ever really something you can guarantee so it would add a large amount of complexity, but let's see what others think

WillAyd on 27 Dec 2018

Is there any reason not to use the builtin CSV sniffer, and pass the output

with open("file.csv") as f:
    header = csv.Sniffer(f).has_header(1024)

pd.read_csv("file.csv", header=header)

TomAugspurger on 27 Dec 2018

👍3

I am -1 here as I don't think this is ever really something you can guarantee so it would add a large
amount of complexity, but let's see what others think

Its true you cannot guarantee the header detection is correct but so are other mechanisms that pandas read_csv api already implements (some are quite complex such as automatically detecting datetime formats when passing parse_dates and infer_datetime_format). The complexity of the header detection depends on the actual implementation.

2.
Indeed these are stuff that should be considered, but this is one of the advantages of having a single api for this (and not csv.Sniffer's api and pandas). for skiprows for example, I think that any rows that are skipped should not be considered as part of the header detection algorithm.

I admit I am not familiar with all of panda's internal implementations, and I agree that there are some api considerations that need to be made. If you find this feature interesting, I am willing to work on a pull request for this matter.

Iddoyadlin on 27 Dec 2018

I agree with Tom and Will that this is probably not a good fit for pandas internally, and also with the OP that this would be a really nice feature to have _somewhere_.

For the features wishlist: correctly round-trip df.to_csv when one or both of index/columns is a MultiIndex.

jbrockmendel on 28 Dec 2018

if i understand the issue correctly, some csv files may not contain header row but most of the csv files will contain header row. Why not there be a function that reads all the header rows(user specified) in all the csv files irrespective of whether it contains it or not, and then finally use the mostly occurring row.

final_header = MostOf (list_of_all_header_rows)

mail2saiky on 28 Dec 2018

Why not there be a function that reads all the header rows(user specified) in all the csv files irrespective of whether it contains it or not, and then finally use the mostly occurring row.

I'm not sure I fully understand this?

In terms of the broader discussion, I am indeed wary of adding more parameters to the read_csv function given how bloated it is. In reality, we should be trying to condense it! 🙂

Also, as was mentioned earlier, the way it would interact with the myriad other parameters we have makes the potential maintenance issue of this parameter somewhat daunting to me.

If you find this feature interesting, I am willing to work on a pull request for this matter.

@Iddoyadlin : I don't think we disagree that a feature would be nice, but we are very much concerned about how it would integrate with the existing implementation.

@TomAugspurger @WillAyd : That being said, perhaps we could fit something about this in the docs (or cookbook)? Though I have not experienced it, the concern does not sound super unreasonable to me.