Pandas: BUG: crosstab(dropna=False) does not work like value_counts(dropna=False)

Created on 9 Aug 2015  Â·  15Comments  Â·  Source: pandas-dev/pandas

If one uses series.value_counts(dropna=False) it includes NaN values, but crosstab(series1, series2, dropna=False) does not include NaN values. I think of cross_tab kind of like 2-dimensional value_counts(), so this is surprising.

In [31]: x = pd.Series(['a', 'b', 'a', None, 'a'])

In [32]: y = pd.Series(['c', 'd', 'd', 'c', None])

In [33]: y.value_counts(dropna=False)
Out[33]: 
d      2
c      2
NaN    1
dtype: int64

In [34]: pd.crosstab(x, y, dropna=False)
Out[34]: 
       c  d
row_0      
a      1  1
b      0  1

I believe what crosstab(..., dropna=False) really does is just include rows/columns that would otherwise have all zeros, but that doesn't seem the same as dropna=False to me. I would have expected to see a row and column entry for NaN, something like:

       c  d  NaN
row_0      
a      1  1    1
b      0  1    0
NaN    1  0    0

Thoughts on this?

This is on current master (almost 0.17).

API Design Reshaping

Most helpful comment

I also would like to see this fixed.

All 15 comments

can you show what you would expect as ouptput?

@jreback I updated the original description to include the expected output.

This further example (a slightly modified version of one from the crosstab docs) shows what dropna currently does:

In [69]: a = np.array(['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', 'foo', 'foo', 'foo'])

In [70]: b = np.array(['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', 'one', 'one', 'one'])

In [71]: c = np.array(['dull', 'dull', 'shiny', 'dull', 'dull', 'shiny', 'shiny', 'dull', 'shiny', 'shiny', 'shiny'])

In [72]: pd.crosstab(a, [b, c], dropna=True)
Out[72]: 
col_0  one        two
col_1 dull shiny dull
row_0                
bar      1     2    1
foo      2     4    1

In [73]: pd.crosstab(a, [b, c], dropna=False)
Out[73]: 
       one        two      
      dull shiny dull shiny
row_0                      
bar      1     2    1     0
foo      2     4    1     0

I think we need a different name for this behavior. drop_empty? drop_all_zero? (or perhaps include_all_zero=False) Is there any naming precedent set elsewhere in the codebase?

Or perhaps the behavior I am looking for should have the new name, like drop_na_values... though really I think the connection with the existing value_counts(dropna=...) behavior is fairly strong.

My R is very rusty, but it does support this feature, just for reference of how another system does it. (I actually don't know how to do in R what pandas is currently doing with dropna, where it includes all Cartesian product variants of a multi-level index.) And R even more explicitly matches the behavior between value_counts() and cross_tab() since you can use the same R function, table, for both.

> x <- c('a', 'b', 'a', NA, 'a')
> y <- c('c', 'd', 'd', 'c', NA)
> table(x)
x
a b 
3 1 
> # one way for including NA's
> table(x, exclude=NULL)
x
   a    b <NA> 
   3    1    1 
> # another way for including NA's
> table(x, useNA='ifany')
x
   a    b <NA> 
   3    1    1 
> table(x, y)
   y
x   c d
  a 1 1
  b 0 1
> table(x, y, useNA='ifany')
      y
x      c d <NA>
  a    1 1    1
  b    0 1    0
  <NA> 1 0    0

I keep running into this. Hopefully I will have time to look at it soon, or it will get fixed as a result of some of the missing-value-handling issues that people are hoping to resolve with a change to pandas underlying data representations.

Here's a related issue (could perhaps be its own issue). Set up x and y as above, and then try and use crosstab with margins:

In [8]: pd.crosstab(x, y, margins=True)
Out[8]: 
col_0  c  d  All
row_0           
a      1  1    3
b      0  1    1
All    2  2    5

In [9]: # huh that's weird, the columns don't sum

In [10]: pd.crosstab(x, y, dropna=False, margins=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/core/index.py in _get_level_number(self, level)
   4417                 raise ValueError('The name %s occurs multiple times, use a '
-> 4418                                  'level number' % level)
   4419             level = self.names.index(level)

ValueError: The name None occurs multiple times, use a level number

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-10-b11080c1dcf8> in <module>()
----> 1 pd.crosstab(x, y, dropna=False, margins=True)

/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/tools/pivot.py in crosstab(index, columns, values, rownames, colnames, aggfunc, margins, dropna)
    426         df['__dummy__'] = 0
    427         table = df.pivot_table('__dummy__', index=rownames, columns=colnames,
--> 428                                aggfunc=len, margins=margins, dropna=dropna)
    429         return table.fillna(0).astype(np.int64)
    430     else:

/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/tools/pivot.py in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name)
    145         table = _add_margins(table, data, values, rows=index,
    146                              cols=columns, aggfunc=aggfunc,
--> 147                              margins_name=margins_name)
    148 
    149     # discard the top level

/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/tools/pivot.py in _add_margins(table, data, values, rows, cols, aggfunc, margins_name)
    175     if hasattr(table, 'columns'):
    176         for level in table.columns.names[1:]:
--> 177             if margins_name in table.columns.get_level_values(level):
    178                 raise ValueError(exception_msg)
    179 

/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/core/index.py in get_level_values(self, level)
   4565         values : ndarray
   4566         """
-> 4567         num = self._get_level_number(level)
   4568         unique = self.levels[num]  # .values
   4569         labels = self.labels[num]

/home/andrew/miniconda3/envs/bayesian/lib/python3.5/site-packages/pandas/core/index.py in _get_level_number(self, level)
   4420         except ValueError:
   4421             if not isinstance(level, int):
-> 4422                 raise KeyError('Level %s not found' % str(level))
   4423             elif level < 0:
   4424                 level += self.nlevels

KeyError: 'Level None not found'

Agreed. @rosnfeld, if I follow right, you think the behavior should be that:

df = pd.DataFrame({'a': [np.nan, np.nan, np.nan], 'b': [3, 3, 4]})
pd.crosstab(df.a, df.b, dropna=False)

returns:

      3  4 
a        
NaN   2  1  

Not an empty dataframe?

Correct. That's similar to what R would do with useNA='ifany':

> a <- c(NA, NA, NA)
> b <- c(3, 3, 4)
> table(a, b)
< table of extent 0 x 2 >
> table(a, b, useNA='ifany')
      b
a      3 4
  <NA> 2 1

It's also what Stata does if one uses the option ", missing"
On Mon, Apr 4, 2016 at 1:15 AM rosnfeld [email protected] wrote:

Correct. That's similar to what R would do with useNA='ifany':

a <- c(NA, NA, NA)
b <- c(3, 3, 4)
table(a, b)
< table of extent 0 x 2 >
table(a, b, useNA='ifany')
b
a 3 4
2 1

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/10772#issuecomment-205185501

dropna : boolean, default True
Do not include columns whose entries are all NaN

Only works when the whole column is NaN, but if some values are NaN, cannot count them

This also strikes me as odd and inconsistent, and I'd love to see a fix. Meanwhile, the way I deal with it is by adding .fillna(' NaN) to the variables called in pd.crosstab(). For instance, in the example above:

pd.crosstab(df.a.fillna(' NaN), df.b.fillna(' NaN), dropna=False)

Of course, you could omit the .fillna(' NaN') from the column with no null values. I add a space before NaN so that the NaN row and/or column appear first. Otherwise, they would be ordered alphabetically and probably be in the middle of the rows/columns in the output (I guess you could prepend it by a z to put it as the last row/column, but that looks a little silly).

I also would like to see this fixed.

@ceilingcat let us know if you need help starting a PR.

I agree this would make a nice improvement, especially since the fillna fix doesn't really work for categorical columns since the fill value needs to be an available level.

Do I have just another example of same problem? Pandas 0.24

I converted some elements in x to missings as `y``,

In [253]: x
Out[253]: 
0     1
1     2
2     3
3     4
4     5
5    55
6    22
7    44
8    12
9    55
dtype: int64

In [254]: y1
Out[254]: 
0     1.0
1     2.0
2     NaN
3     4.0
4     5.0
5    55.0
6     NaN
7    44.0
8     NaN
9    55.0
dtype: float64

In [255]: y1.value_counts(dropna=False)
Out[255]: 
NaN     3
55.0    2
44.0    1
5.0     1
4.0     1
2.0     1
1.0     1
dtype: int64

In [256]: pd.crosstab(x, y1, dropna=False)
Out[258]: 
col_0  1.0   2.0   4.0   5.0   44.0  55.0
row_0                                    
1         1     0     0     0     0     0
2         0     1     0     0     0     0
4         0     0     1     0     0     0
5         0     0     0     1     0     0
44        0     0     0     0     1     0
55        0     0     0     0     0     2

I expect to see a column labeled "NaN" with values

        1.0   2.0   4.0  5.0  44.0  55.0  NaN
1         1     0     0     0     0     0          0
3         0     0     0     0     0     0          1
2         0     1     0     0     0     0          0
4         0     0     1     0     0     0          0
5         0     0     0     1     0     0          0
12       0     0     0     0     0     0          1
22       0     0     0     0     0     0          1
44       0     0     0     0     1     0          0
55       0     0     0     0     0     2          0

The current (0.24) Pandas documentation should say dropna: "Do not include columns OR ROWS whose entries are all NaN", because that is what the current behavior actually seems to be: when rows/columns are entirely empty, rows/columns are dropped with default dropna = True.

With dropna=False, empty rows show up as 0 value entries in columns, and empty columns show up as 0 values in rows. Previous example shows that NaN is apparently not considered to be a possible row/column value with dropna=False.

It seems reasonable to suggest that NaN be added to the set of possible values for both rows and columns in the dataframe created when crosstab has dropna=False set by a user. That would replicate behavior in other statistical software (e.g., Stata when tab used with "missing" option). More importantly, using this option necessarily means you are interested in understanding the joint distribution of NaNs in the two series.

Current behavior of crosstab('y_variable','x_variable',dropna=False) appears to be equivalent to groupby('y_variable').agg({'x_variable':'count'}) with a binary dummy x variable.

To preserve this equivalence, it may be worthwhile to instead consider a switch adding NaN to the sets of tabulated values in a crosstab rather than changing behavior with dropna=False.

Is there a reasonably clean way to get the desired output within the context of Pandas as it is today?

Was this page helpful?
0 / 5 - 0 ratings