I'd like to read a set of csv files but exclude specific columns. read_csv currently has a usecols keyword, but it requires writing a list of all the columns present. This is a bit tedious and more importantly, not all files have the same columns, so usecols would not work in general cases, whereas a complimentary function would work. Can a skipcols keyword be added to 0.17 that accepts a list of column names and reads all but those columns into a DataFrame? Thanks.
xref #4749
xref #8985
xref #6710
So read_csv() is defined by calling _make_parser_function() which calls _read(). Any instructions would be appreciated. It's a bit confusing to me. @jreback
parser is a bit complicated. see how usecols is used.
It looks like the code related to usecols needs to be re-factored before I add things on top of it. I won't be able to re-factor coz I might break a lot of internal things. @jreback
Is there a spot in the code where when know all the columns before starting to parse the rows? If so you can assign usecols=set(all_cols) - set(skipcols) (would need to fixup the ordering afterwards) and go from there.
Yeah I did something similar but stopped due to other code related to
usecols. I'll look into it again. Thanks.
On Sep 2, 2015 8:38 AM, "Tom Augspurger" [email protected] wrote:
Is there a spot in the code where when know all the columns before
starting to parse the rows? If so you can take set usecols=set(all_cols)
- set(skipcols) (would need to fixup the ordering afterwards) and go from
there.—
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/10882#issuecomment-137060844.
Any progress on this addition? Thanks.
ping @jreback
if you submit a PR there will be progress
we have 1700 open issues
many thanks
In a similar vain, is there a way to read in a subset of rows? In other words, is there a counterpart to the skiprows keyword? For example, this feature is desired:
df = pd.read_csv("bigdata.csv")
df
# Output: Millions of rows
selection = [i for i in range(0, 1000000) if i % 2 == 0]
subset = pd.read_csv("bigdata.csv", use_rows=selection) # skip all rows except those listed
subset
# Output: only even rows for the first million
@pylang : We now accept callable for usecols. Does that help to resolve this issue?
sure, maybe an example of doing that in io.rst would be helpful?
@gfyoung I'm not sure what you have in mind. I am interested in selecting rows. An example would be helpful, thank you.
@pylang :
1) Your original issue was for skipcols though?
2) skiprows is currently not supported by the C engine. However, we could by all means allow skiprows be a callable like usecols is? How does that sound? Something like:
~~~python
data = 'a,b,cn1,2,3n2,3,4'
read_csv(StringIO(data), skiprows=lambda x: x%2 == 0, engine='python')
a b c
2 3 4
~~~
wherexis the row number (starting at 0)
@jreback : There are examples in the docs to illustrate usecols, but we can also mention that we can use the callable to exclude columns as well. How does that sound?
yes that's what i mean, to show using s callable to skipcols
@gfyoung I think your example for skiprows would suffice. And yes you are correct re: skipcols. A similar callable option to filter usecols with an example in the docs would be sufficient imo.
@pylang : #15059 is up to address skiprows. I've hit a roadblock at this point implementing it for the C engine, so any input on that would be appreciated!
Most helpful comment
@pylang : #15059 is up to address
skiprows. I've hit a roadblock at this point implementing it for the C engine, so any input on that would be appreciated!