Pandas: Feature request: Option to include NaNs in value_counts()

Created on 22 Nov 2013 · 9Comments · Source: pandas-dev/pandas

I find it highly valuable to also receive the information how many NaN values are in my Series.
Could we have an option in value_counts(), maybe include_nans=True that would add a count for those in the output of it?

Enhancement Numeric

Source

michaelaye

Most helpful comment

I wouldn't include it by default as well. But it is extremely helpful to have in one overview what the rough ratio is between successful measurements with values and nans.
In my case, I have 3 different categories plus 'not categorized'. When displayed like so:

'a'   1000
'b'   5000
'c'   4000
'nan' 100000

I immediately know that something went wrong, which I wouldn't suspect if I don't see the nans. When I have to do this without the nans, I first have to sum up all the real values and compare it to the length of series, it's always one more step to do. Sure, I can write my own wrapper, but I thought it's a useful feature to have it at least as an option to the value_counts call.
What did you mean by or for the mode?

michaelaye on 22 Nov 2013

👍9

All 9 comments

to be clear, you can already do this: series.isnull().sum()

The problem is that you'd end up with an Index with nan with it, which causes problems.

jtratner on 22 Nov 2013

Yes, I know the workaround, but I would like to see it solved for value_counts, because it is a relevant piece of information for the scope of value_counts.
I see the problem with nan in the Index. How about converting the index to dtype 'object' just for this output. Then the nan could be packed into a string?

michaelaye on 22 Nov 2013

this is trivial to implement and the Index with nan issues aren't that big. We wouldn't include nan by default.

so we're down to a design decision. To me, nan means missing value, so it doesn't make sense to show up as a counted value or for the mode.

jtratner on 22 Nov 2013

'a'   1000
'b'   5000
'c'   4000
'nan' 100000

michaelaye on 22 Nov 2013

👍9

well, mode is really a special case of value_counts(), so if you include it for one it makes sense to use the same kwarg for the other.

jtratner on 22 Nov 2013

I have never used 'mode' and don't understand what it does. In my case:

print df.marking.mode()
print 
print df.marking.value_counts()

0    blotch
dtype: object

blotch         3854641
fan            3192799
None           2785831
interesting     884843
dtype: int64

I cheated by replacing nans by None string. ;)

michaelaye on 22 Nov 2013

From my experience, it is very helpful for nan to show up as a counted value.

I work in a SAS shop, but I'm moving all of my analysis and reporting work from SAS to Python. I use value_counts to give me the results I would get from PROC FREQ in SAS. I use PROC FREQ daily, and almost always I'm looking at real-world data with missing values. I honestly cannot remember a case where I didn't want the missing values to be in the frequency counts.

I've got to believe I'm nowhere near the only person who needs to see frequency counts for nan values. I could see the lack of this feature slowing the adoption of pandas among SAS users.

SAS does not report missing values in frequency reports by default, but I'm OK with always selecting that option when I run PROC FREQ.

I do know how to add the missing counts to my value_counts output, but it's annoying to need to do it pretty much every time I use value_counts. If I -- and probably others -- need to do this every time we use value_counts it seems like including nan counts in the results is a reasonable option to add to the method.