If one uses series.value_counts(dropna=False)
it includes NaN
values, but crosstab(series1, series2, dropna=False)
does not include NaN
values. I think of cross_tab
kind of like 2-dimensional value_counts()
, so this is surprising.
In [31]: x = pd.Series(['a', 'b', 'a', None, 'a'])
In [32]: y = pd.Series(['c', 'd', 'd', 'c', None])
In [33]: y.value_counts(dropna=False)
Out[33]:
d 2
c 2
NaN 1
dtype: int64
In [34]: pd.crosstab(x, y, dropna=False)
Out[34]:
c d
row_0
a 1 1
b 0 1
I believe what crosstab(..., dropna=False)
really does is just include rows/columns that would otherwise have all zeros, but that doesn't seem the same as dropna=False
to me. I would have expected to see a row and column entry for NaN
, something like:
c d NaN
row_0
a 1 1 1
b 0 1 0
NaN 1 0 0
Thoughts on this?
This is on current master (almost 0.17).
can you show what you would expect as ouptput?
@jreback I updated the original description to include the expected output.
This further example (a slightly modified version of one from the crosstab docs) shows what dropna
currently does:
In [69]: a = np.array(['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', 'foo', 'foo', 'foo'])
In [70]: b = np.array(['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', 'one', 'one', 'one'])
In [71]: c = np.array(['dull', 'dull', 'shiny', 'dull', 'dull', 'shiny', 'shiny', 'dull', 'shiny', 'shiny', 'shiny'])
In [72]: pd.crosstab(a, [b, c], dropna=True)
Out[72]:
col_0 one two
col_1 dull shiny dull
row_0
bar 1 2 1
foo 2 4 1
In [73]: pd.crosstab(a, [b, c], dropna=False)
Out[73]:
one two
dull shiny dull shiny
row_0
bar 1 2 1 0
foo 2 4 1 0
I think we need a different name for this behavior. drop_empty
? drop_all_zero
? (or perhaps include_all_zero=False
) Is there any naming precedent set elsewhere in the codebase?
Or perhaps the behavior I am looking for should have the new name, like drop_na_values
... though really I think the connection with the existing value_counts(dropna=...)
behavior is fairly strong.
My R is very rusty, but it does support this feature, just for reference of how another system does it. (I actually don't know how to do in R what pandas is currently doing with dropna
, where it includes all Cartesian product variants of a multi-level index.) And R even more explicitly matches the behavior between value_counts()
and cross_tab()
since you can use the same R function, table
, for both.
> x <- c('a', 'b', 'a', NA, 'a')
> y <- c('c', 'd', 'd', 'c', NA)
> table(x)
x
a b
3 1
> # one way for including NA's
> table(x, exclude=NULL)
x
a b <NA>
3 1 1
> # another way for including NA's
> table(x, useNA='ifany')
x
a b <NA>
3 1 1
> table(x, y)
y
x c d
a 1 1
b 0 1
> table(x, y, useNA='ifany')
y
x c d <NA>
a 1 1 1
b 0 1 0
<NA> 1 0 0
I keep running into this. Hopefully I will have time to look at it soon, or it will get fixed as a result of some of the missing-value-handling issues that people are hoping to resolve with a change to pandas underlying data representations.
Here's a related issue (could perhaps be its own issue). Set up x and y as above, and then try and use crosstab with margins:
In [8]: pd.crosstab(x, y, margins=True)
Out[8]:
col_0 c d All
row_0
a 1 1 3
b 0 1 1
All 2 2 5
In [9]: # huh that's weird, the columns don't sum
In [10]: pd.crosstab(x, y, dropna=False, margins=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/core/index.py in _get_level_number(self, level)
4417 raise ValueError('The name %s occurs multiple times, use a '
-> 4418 'level number' % level)
4419 level = self.names.index(level)
ValueError: The name None occurs multiple times, use a level number
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-10-b11080c1dcf8> in <module>()
----> 1 pd.crosstab(x, y, dropna=False, margins=True)
/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/tools/pivot.py in crosstab(index, columns, values, rownames, colnames, aggfunc, margins, dropna)
426 df['__dummy__'] = 0
427 table = df.pivot_table('__dummy__', index=rownames, columns=colnames,
--> 428 aggfunc=len, margins=margins, dropna=dropna)
429 return table.fillna(0).astype(np.int64)
430 else:
/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/tools/pivot.py in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name)
145 table = _add_margins(table, data, values, rows=index,
146 cols=columns, aggfunc=aggfunc,
--> 147 margins_name=margins_name)
148
149 # discard the top level
/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/tools/pivot.py in _add_margins(table, data, values, rows, cols, aggfunc, margins_name)
175 if hasattr(table, 'columns'):
176 for level in table.columns.names[1:]:
--> 177 if margins_name in table.columns.get_level_values(level):
178 raise ValueError(exception_msg)
179
/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/core/index.py in get_level_values(self, level)
4565 values : ndarray
4566 """
-> 4567 num = self._get_level_number(level)
4568 unique = self.levels[num] # .values
4569 labels = self.labels[num]
/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/core/index.py in _get_level_number(self, level)
4420 except ValueError:
4421 if not isinstance(level, int):
-> 4422 raise KeyError('Level %s not found' % str(level))
4423 elif level < 0:
4424 level += self.nlevels
KeyError: 'Level None not found'
Agreed. @rosnfeld, if I follow right, you think the behavior should be that:
df = pd.DataFrame({'a': [np.nan, np.nan, np.nan], 'b': [3, 3, 4]})
pd.crosstab(df.a, df.b, dropna=False)
returns:
3 4
a
NaN 2 1
Not an empty dataframe?
Correct. That's similar to what R would do with useNA='ifany'
:
> a <- c(NA, NA, NA)
> b <- c(3, 3, 4)
> table(a, b)
< table of extent 0 x 2 >
> table(a, b, useNA='ifany')
b
a 3 4
<NA> 2 1
It's also what Stata does if one uses the option ", missing"
On Mon, Apr 4, 2016 at 1:15 AM rosnfeld [email protected] wrote:
Correct. That's similar to what R would do with useNA='ifany':
a <- c(NA, NA, NA)
b <- c(3, 3, 4)
table(a, b)
< table of extent 0 x 2 >
table(a, b, useNA='ifany')
b
a 3 4
2 1 —
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/10772#issuecomment-205185501
dropna : boolean, default True
Do not include columns whose entries are all NaN
Only works when the whole column is NaN, but if some values are NaN, cannot count them
This also strikes me as odd and inconsistent, and I'd love to see a fix. Meanwhile, the way I deal with it is by adding .fillna(' NaN)
to the variables called in pd.crosstab()
. For instance, in the example above:
pd.crosstab(df.a.fillna(' NaN), df.b.fillna(' NaN), dropna=False)
Of course, you could omit the .fillna(' NaN')
from the column with no null values. I add a space before NaN
so that the NaN
row and/or column appear first. Otherwise, they would be ordered alphabetically and probably be in the middle of the rows/columns in the output (I guess you could prepend it by a z
to put it as the last row/column, but that looks a little silly).
I also would like to see this fixed.
@ceilingcat let us know if you need help starting a PR.
I agree this would make a nice improvement, especially since the fillna
fix doesn't really work for categorical columns since the fill value needs to be an available level.
Do I have just another example of same problem? Pandas 0.24
I converted some elements in x
to missings as `y``,
In [253]: x
Out[253]:
0 1
1 2
2 3
3 4
4 5
5 55
6 22
7 44
8 12
9 55
dtype: int64
In [254]: y1
Out[254]:
0 1.0
1 2.0
2 NaN
3 4.0
4 5.0
5 55.0
6 NaN
7 44.0
8 NaN
9 55.0
dtype: float64
In [255]: y1.value_counts(dropna=False)
Out[255]:
NaN 3
55.0 2
44.0 1
5.0 1
4.0 1
2.0 1
1.0 1
dtype: int64
In [256]: pd.crosstab(x, y1, dropna=False)
Out[258]:
col_0 1.0 2.0 4.0 5.0 44.0 55.0
row_0
1 1 0 0 0 0 0
2 0 1 0 0 0 0
4 0 0 1 0 0 0
5 0 0 0 1 0 0
44 0 0 0 0 1 0
55 0 0 0 0 0 2
I expect to see a column labeled "NaN" with values
1.0 2.0 4.0 5.0 44.0 55.0 NaN
1 1 0 0 0 0 0 0
3 0 0 0 0 0 0 1
2 0 1 0 0 0 0 0
4 0 0 1 0 0 0 0
5 0 0 0 1 0 0 0
12 0 0 0 0 0 0 1
22 0 0 0 0 0 0 1
44 0 0 0 0 1 0 0
55 0 0 0 0 0 2 0
The current (0.24) Pandas documentation should say dropna: "Do not include columns OR ROWS whose entries are all NaN", because that is what the current behavior actually seems to be: when rows/columns are entirely empty, rows/columns are dropped with default dropna = True.
With dropna=False, empty rows show up as 0 value entries in columns, and empty columns show up as 0 values in rows. Previous example shows that NaN is apparently not considered to be a possible row/column value with dropna=False.
It seems reasonable to suggest that NaN be added to the set of possible values for both rows and columns in the dataframe created when crosstab has dropna=False set by a user. That would replicate behavior in other statistical software (e.g., Stata when tab used with "missing" option). More importantly, using this option necessarily means you are interested in understanding the joint distribution of NaNs in the two series.
Current behavior of crosstab('y_variable','x_variable',dropna=False) appears to be equivalent to groupby('y_variable').agg({'x_variable':'count'}) with a binary dummy x variable.
To preserve this equivalence, it may be worthwhile to instead consider a switch adding NaN to the sets of tabulated values in a crosstab rather than changing behavior with dropna=False.
Is there a reasonably clean way to get the desired output within the context of Pandas as it is today?
Most helpful comment
I also would like to see this fixed.