Seaborn: pairplot seems to miss interpret strings such as '0' and '1' as integers

Created on 13 Dec 2017  Â·  13Comments  Â·  Source: mwaskom/seaborn

After casting a panda column from integers to string, pairplot keeps interpreting this column as containing numerical values it seems.

I am currently using:
panda: 0.21.1 numpy: 1.13.3 sns 0.8.1 matplotlib version: 2.1.1 matplotlib backend: module://ipykernel.pylab.backend_inline system: sys.version_info(major=2, minor=7, micro=12, releaselevel='final', serial=0)

Example, generate some data:
test = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
Threshold the first column
test['A'] = test['A'].apply(lambda x: 0 if x < 0 else 1)

and cast:

test['A'] = test['A'].astype(str)

Alternatively, this also produces the same error later:

test['A'] = test['A'].apply(str)

Calling pairplot, with the A column
_ = sns.pairplot(test, hue="A")
results in
ValueError: setting an array element with a sequence

Defining and applying a map solves this issue:
custom_map = {'0': 'zero', '1': 'one'}

test['A'] = test['A'].map(custom_map)

Full example in a jupyter notebook:

Most helpful comment

And in the meantime you should do

sns.pairplot(test, hue="A", vars=["B", "C", "D"])

All 13 comments

Can you show the full stacktrace of the error you get? I can't reproduce the ValueError on a slightly different version of matplotlib.

And in the meantime you should do

sns.pairplot(test, hue="A", vars=["B", "C", "D"])

This works, thanks!

python sns.pairplot(test, hue="A", vars=["B", "C", "D"])

I updated the gist with error outputs. The stacktrace for

python _ = sns.pairplot(test, hue="A")
is:

````python

ValueError Traceback (most recent call last)
in ()
1 # Gives an error
----> 2 _ = sns.pairplot(test, hue="A")

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/seaborn/axisgrid.pyc in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, size, aspect, dropna, plot_kws, diag_kws, grid_kws)
2058 if grid.square_grid:
2059 if diag_kind == "hist":
-> 2060 grid.map_diag(plt.hist, **diag_kws)
2061 elif diag_kind == "kde":
2062 diag_kws["legend"] = False

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/seaborn/axisgrid.pyc in map_diag(self, func, *kwargs)
1363 func(vals, color=color, *
kwargs)
1364 else:
-> 1365 func(vals, color=color, histtype="barstacked", **kwargs)
1366
1367 else:

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/pyplot.pyc in hist(x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, normed, hold, data, *kwargs)
3023 histtype=histtype, align=align, orientation=orientation,
3024 rwidth=rwidth, log=log, color=color, label=label,
-> 3025 stacked=stacked, normed=normed, data=data, *
kwargs)
3026 finally:
3027 ax._hold = washold

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/__init__.pyc in inner(ax, args, *kwargs)
1715 warnings.warn(msg % (label_namer, func.__name__),
1716 RuntimeWarning, stacklevel=2)
-> 1717 return func(ax, args, *kwargs)
1718 pre_doc = inner.__doc__
1719 if pre_doc is None:

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/axes/_axes.pyc in hist(failed resolving arguments)
6096
6097 # process the unit information
-> 6098 self._process_unit_info(xdata=x, kwargs=kwargs)
6099 x = self.convert_xunits(x)
6100 if bin_range is not None:

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/axes/_base.pyc in _process_unit_info(self, xdata, ydata, kwargs)
1987 # we only need to update if there is nothing set yet.
1988 if not self.xaxis.have_units():
-> 1989 self.xaxis.update_units(xdata)
1990
1991 if ydata is not None:

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/axis.pyc in update_units(self, data)
1436 neednew = self.converter != converter
1437 self.converter = converter
-> 1438 default = self.converter.default_units(data, self)
1439 if default is not None and self.units is None:
1440 self.set_units(default)

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/category.pyc in default_units(data, axis)
70 # default_units->axis_info->convert
71 if axis.unit_data is None:
---> 72 axis.unit_data = UnitData(data)
73 else:
74 axis.unit_data.update(data)

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/category.pyc in __init__(self, data)
102 """
103 self.seq, self.locs = [], []
--> 104 self._set_seq_locs(data, 0)
105
106 def update(self, new_data):

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/category.pyc in _set_seq_locs(self, data, value)
110
111 def _set_seq_locs(self, data, value):
--> 112 strdata = shim_array(data)
113 new_s = [d for d in np.unique(strdata) if d not in self.seq]
114 for ns in new_s:

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/category.pyc in shim_array(data)
19 if LooseVersion(np.__version__) >= LooseVersion('1.8.0'):
20 def shim_array(data):
---> 21 return np.array(data, dtype=np.unicode)
22 else:
23 def shim_array(data):

ValueError: setting an array element with a sequence


AttributeError Traceback (most recent call last)
/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/IPython/core/formatters.pyc in __call__(self, obj)
332 pass
333 else:
--> 334 return printer(obj)
335 # Finally look for special method names
336 method = get_real_method(obj, self.print_method)

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/IPython/core/pylabtools.pyc in (fig)
239
240 if 'png' in formats:
--> 241 png_formatter.for_type(Figure, lambda fig: print_figure(fig, 'png', *kwargs))
242 if 'retina' in formats or 'png2x' in formats:
243 png_formatter.for_type(Figure, lambda fig: retina_figure(fig, *
kwargs))

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/IPython/core/pylabtools.pyc in print_figure(fig, fmt, bbox_inches, *kwargs)
123
124 bytes_io = BytesIO()
--> 125 fig.canvas.print_figure(bytes_io, *
kw)
126 data = bytes_io.getvalue()
127 if fmt == 'svg':

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/backend_bases.pyc in print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, *kwargs)
2216 *
kwargs)
2217 renderer = self.figure._cachedRenderer
-> 2218 bbox_inches = self.figure.get_tightbbox(renderer)
2219
2220 bbox_artists = kwargs.pop("bbox_extra_artists", None)

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/figure.pyc in get_tightbbox(self, renderer)
1981 for ax in self.axes:
1982 if ax.get_visible():
-> 1983 bb.append(ax.get_tightbbox(renderer))
1984
1985 if len(bb) == 0:

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/axes/_base.pyc in get_tightbbox(self, renderer, call_axes_locator)
4007
4008 if self.title.get_visible():
-> 4009 bb.append(self.title.get_window_extent(renderer))
4010 if self._left_title.get_visible():
4011 bb.append(self._left_title.get_window_extent(renderer))

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/text.pyc in get_window_extent(self, renderer, dpi)
923 self.figure.dpi = dpi
924 if self.get_text() == '':
--> 925 tx, ty = self._get_xy_display()
926 return Bbox.from_bounds(tx, ty, 0, 0)
927

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/text.pyc in _get_xy_display(self)
232 def _get_xy_display(self):
233 'get the (possibly unit converted) transformed x, y in display coords'
--> 234 x, y = self.get_unitless_position()
235 return self.get_transform().transform_point((x, y))
236

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/text.pyc in get_unitless_position(self)
853 # This will get the position with all unit information stripped away.
854 # This is here for convienience since it is done in several locations.
--> 855 x = float(self.convert_xunits(self._x))
856 y = float(self.convert_yunits(self._y))
857 return x, y

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/artist.pyc in convert_xunits(self, x)
189 if ax is None or ax.xaxis is None:
190 return x
--> 191 return ax.xaxis.convert_units(x)
192
193 def convert_yunits(self, y):

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/axis.pyc in convert_units(self, x)
1489 return x
1490
-> 1491 ret = self.converter.convert(x, self.units, self)
1492 return ret
1493

/home/diephuis/virtualenv/tf/local/lib/python2.7/site-packages/matplotlib/category.pyc in convert(value, unit, axis)
47 if isinstance(val, six.string_types):
48 axis.unit_data.update(val)
---> 49 vmap = dict(zip(axis.unit_data.seq, axis.unit_data.locs))
50
51 if isinstance(value, six.string_types):

AttributeError: 'NoneType' object has no attribute 'seq'
````

This is very confusing. Here are a few observations.

One explanation for why you're surprised is that seaborn isn't selecting columns to show in pairplot by checking the dtype, because it's not uncommon in pandas to end up with a basically numeric column that has an object dtype (e.g. if you are mixing ints and nans). Instead, it just asks if the data can be cast to float without raising an error, which is the case if you have all numbers encoded as strings.

But then why is it raising an error? The main issue seems to be happening inside the new matplotlib categorical machinery. I think it should probably be considered a bug in matplotlib because there is inconsistency in whether hist can draw multiple datasets if they're numeric or categorical:

f, axes = plt.subplots(1, 4)
axes[0].hist([0, 0, 1])  # Works
axes[1].hist(["0", "0", "1"])  # Works
axes[2].hist([[0, 1, 0], [1, 0]])   # Works
axes[3].hist([["0", "1", "0"], ["1", "0"]])   # Errors

Arguably what seaborn should do is to check if columns can be converted to float and then use the float representation rather than using the original data. I guess I didn't anticipate people having string-typed numeric data. But now that matplotlib plots can handle categorical data, any type should actually plot fine. I haven't really decided in general what to do with the new matplotlib features (they're quite fresh).

Thanks for looking into this! I will leave it up to you whether or not to submit an issue to Matplotlib.

@mwaskom , would it be possible for thehue kwarg to consume its column/Series so that it's not displayed? Alternatively, is there a syntax that could be used to invert the vars kwarg and exclude Series from being displayed?

I ran into the same problem as @mdiephuis when using integer cluster labels from scikit-learn to set the hue in pairplot. If I leave the label Series as integers then pairplot displays it, which I don't want. Otherwise I have to map the integers. In my use case I have a MultiIndexed DataFrame so using vars requires typing the (level0, level1,...) column names. Not onerous but still a workaround.

If the issue were merely matplotlib's treatment of integer categorical variables, then a user could slice the DataFrame to exclude them before plotting. The problem arises because I need the integer categoricals to set the hue.

I both made my hue variable into strings and categorical dtype and it still didn't work. The behaviour of Seaborn is not very intuitive. And I don't see how anyone would ever want to include the hue variable as one of the variables being compared..?

With matplotlib 3.1.0 and seaborn 0.9.0, I get the sample code by @mdiephuis to run and plot without any errors. Also for the example by @mwaskom for ax.hist inconsistencies.

It seems that the rationale for duck-typing -- Pandas could not mix integers and NaNs -- has been fixed recently:

https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support

This means that it would probably be safe to use the explicit dtype rather than duck-typing, which is the source of many issues in this tracker, and ask users with mixed integer-NaN tables to use the official Pandas way to handle that.

Hi @mwaskom, maybe this wasn't obvious in my post, but the change in Pandas that I mentioned does not automatically make this issue go away; rather, it offers a simple possible fix.

As you mentioned above:

it's not uncommon in pandas to end up with a basically numeric column that has an object dtype (e.g. if you are mixing ints and nans). Instead, it just asks if the data can be cast to float without raising an error, which is the case if you have all numbers encoded as strings.

So removing these implicit casts, and relying on Pandas's new "integer NaN" support to encode NaNs, means that you can maintain the dtype all the time, and this will result in less user confusion (and fewer subtle bugs when strings suddenly contain number-like text).

Yes, but: seaborn has to maintain compatibility with old versions of pandas. Even though the dtype declaration is on the user side, seaborn still needs to be able to check types in a sane way. Numpy does not understand the new pandas types, and as far as I can tell, pandas support for checking “numeric” types that can generalize to their new nan-aware types was introduced in 0.20, which is newer than the minimal version seaborn supports. The pandas code is pretty complicated, and I don’t want to replicate it inside seaborn. So I’m excited about what pandas is doing (though note that it is still described as “experimental” on their end), but seaborn can’t take advantage of it quite yet.

By the way I closed this because the specific issue in the opening post is now resolved. The questions that were raised about how to interpret variables are not unimportant, but it’s confusing to have open issues where the interesting information arises midthread.

You're right, and I understand why you closed it! Backwards compatibility is not something to take lightly. I wish I could suggest a better solution to the dtypes problem, but I couldn't come up with one, other than conversions being configurable somehow. Thanks for replying!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

alexpetralia picture alexpetralia  Â·  3Comments

Bercio picture Bercio  Â·  3Comments

phantom0301 picture phantom0301  Â·  3Comments

amelio-vazquez-reina picture amelio-vazquez-reina  Â·  4Comments

chanshing picture chanshing  Â·  3Comments