Pandas: Feature Request: skipcols in .read_csv

Created on 22 Aug 2015 · 18Comments · Source: pandas-dev/pandas

I'd like to read a set of csv files but exclude specific columns. read_csv currently has a usecols keyword, but it requires writing a list of all the columns present. This is a bit tedious and more importantly, not all files have the same columns, so usecols would not work in general cases, whereas a complimentary function would work. Can a skipcols keyword be added to 0.17 that accepts a list of column names and reads all but those columns into a DataFrame? Thanks.

xref #4749
xref #8985
xref #6710

Docs Enhancement IO CSV

Source

pylang

Most helpful comment

@pylang : #15059 is up to address skiprows. I've hit a roadblock at this point implementing it for the C engine, so any input on that would be appreciated!

gfyoung on 4 Jan 2017

🎉1 👍1

All 18 comments

So read_csv() is defined by calling _make_parser_function() which calls _read(). Any instructions would be appreciated. It's a bit confusing to me. @jreback

terrytangyuan on 1 Sep 2015

parser is a bit complicated. see how usecols is used.

jreback on 1 Sep 2015

It looks like the code related to usecols needs to be re-factored before I add things on top of it. I won't be able to re-factor coz I might break a lot of internal things. @jreback

terrytangyuan on 2 Sep 2015

Is there a spot in the code where when know all the columns before starting to parse the rows? If so you can assign usecols=set(all_cols) - set(skipcols) (would need to fixup the ordering afterwards) and go from there.

TomAugspurger on 2 Sep 2015

Yeah I did something similar but stopped due to other code related to
usecols. I'll look into it again. Thanks.
On Sep 2, 2015 8:38 AM, "Tom Augspurger" [email protected] wrote:

Is there a spot in the code where when know all the columns before
starting to parse the rows? If so you can take set usecols=set(all_cols)

set(skipcols) (would need to fixup the ordering afterwards) and go from
there.

—
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/10882#issuecomment-137060844.

terrytangyuan on 2 Sep 2015

Any progress on this addition? Thanks.

pylang on 20 Dec 2015

ping @jreback

pylang on 10 Jul 2016

if you submit a PR there will be progress
we have 1700 open issues

jreback on 10 Jul 2016

many thanks

pylang on 11 Jul 2016

In a similar vain, is there a way to read in a subset of rows? In other words, is there a counterpart to the skiprows keyword? For example, this feature is desired:

df = pd.read_csv("bigdata.csv")
df
# Output: Millions of rows

selection = [i for i in range(0, 1000000) if i % 2 == 0]
subset = pd.read_csv("bigdata.csv", use_rows=selection)    # skip all rows except those listed
subset
# Output: only even rows for the first million

pylang on 25 Sep 2016

@pylang : We now accept callable for usecols. Does that help to resolve this issue?

gfyoung on 4 Jan 2017

👍1

sure, maybe an example of doing that in io.rst would be helpful?

jreback on 4 Jan 2017

@gfyoung I'm not sure what you have in mind. I am interested in selecting rows. An example would be helpful, thank you.

pylang on 4 Jan 2017

@pylang :

1) Your original issue was for skipcols though?
2) skiprows is currently not supported by the C engine. However, we could by all means allow skiprows be a callable like usecols is? How does that sound? Something like:
~~~python

data = 'a,b,cn1,2,3n2,3,4'
read_csv(StringIO(data), skiprows=lambda x: x%2 == 0, engine='python')
a b c
2 3 4
~~~
where x is the row number (starting at 0)

gfyoung on 4 Jan 2017

@jreback : There are examples in the docs to illustrate usecols, but we can also mention that we can use the callable to exclude columns as well. How does that sound?

gfyoung on 4 Jan 2017

yes that's what i mean, to show using s callable to skipcols

jreback on 4 Jan 2017

👍1

@gfyoung I think your example for skiprows would suffice. And yes you are correct re: skipcols. A similar callable option to filter usecols with an example in the docs would be sufficient imo.

pylang on 4 Jan 2017

@pylang : #15059 is up to address skiprows. I've hit a roadblock at this point implementing it for the C engine, so any input on that would be appreciated!

gfyoung on 4 Jan 2017

🎉1 👍1

Was this page helpful?

0 / 5 - 0 ratings