I find it highly valuable to also receive the information how many NaN values are in my Series.
Could we have an option in value_counts(), maybe include_nans=True that would add a count for those in the output of it?
to be clear, you can already do this: series.isnull().sum()
The problem is that you'd end up with an Index with nan with it, which causes problems.
Yes, I know the workaround, but I would like to see it solved for value_counts, because it is a relevant piece of information for the scope of value_counts.
I see the problem with nan in the Index. How about converting the index to dtype 'object' just for this output. Then the nan could be packed into a string?
this is trivial to implement and the Index with nan issues aren't that big. We wouldn't include nan by default.
so we're down to a design decision. To me, nan means missing value, so it doesn't make sense to show up as a counted value or for the mode.
I wouldn't include it by default as well. But it is extremely helpful to have in one overview what the rough ratio is between successful measurements with values and nans.
In my case, I have 3 different categories plus 'not categorized'. When displayed like so:
'a' 1000
'b' 5000
'c' 4000
'nan' 100000
I immediately know that something went wrong, which I wouldn't suspect if I don't see the nans. When I have to do this without the nans, I first have to sum up all the real values and compare it to the length of series, it's always one more step to do. Sure, I can write my own wrapper, but I thought it's a useful feature to have it at least as an option to the value_counts call.
What did you mean by or for the mode?
well, mode is really a special case of value_counts(), so if you include it for one it makes sense to use the same kwarg for the other.
I have never used 'mode' and don't understand what it does. In my case:
print df.marking.mode()
print
print df.marking.value_counts()
0 blotch
dtype: object
blotch 3854641
fan 3192799
None 2785831
interesting 884843
dtype: int64
I cheated by replacing nans by None string. ;)
From my experience, it is very helpful for nan to show up as a counted value.
I work in a SAS shop, but I'm moving all of my analysis and reporting work from SAS to Python. I use value_counts to give me the results I would get from PROC FREQ in SAS. I use PROC FREQ daily, and almost always I'm looking at real-world data with missing values. I honestly cannot remember a case where I didn't want the missing values to be in the frequency counts.
I've got to believe I'm nowhere near the only person who needs to see frequency counts for nan values. I could see the lack of this feature slowing the adoption of pandas among SAS users.
SAS does not report missing values in frequency reports by default, but I'm OK with always selecting that option when I run PROC FREQ.
I do know how to add the missing counts to my value_counts output, but it's annoying to need to do it pretty much every time I use value_counts. If I -- and probably others -- need to do this every time we use value_counts it seems like including nan counts in the results is a reasonable option to add to the method.
+1 will fix along with #7424.
Something similar is in #7217. If NaN is a problem in index, this will also come up in Categoricals: https://github.com/jreback/pandas/commit/725a49795964456135fdc495572f7a32a40fd3ec
Most helpful comment
I wouldn't include it by default as well. But it is extremely helpful to have in one overview what the rough ratio is between successful measurements with values and nans.
In my case, I have 3 different categories plus 'not categorized'. When displayed like so:
I immediately know that something went wrong, which I wouldn't suspect if I don't see the nans. When I have to do this without the nans, I first have to sum up all the real values and compare it to the length of series, it's always one more step to do. Sure, I can write my own wrapper, but I thought it's a useful feature to have it at least as an option to the value_counts call.
What did you mean by
or for the mode?