Pandas: Series.str.decode() turns arrays of strings to NaN and fails on byte strings

Created on 5 Sep 2018  路  9Comments  路  Source: pandas-dev/pandas

Hello,

this looks like a bug:

x = np.array(['x','y'])
pd.Series(x).str.decode(encoding='UTF-8',errors='strict')
0   NaN
1   NaN
dtype: float64

the line above is also used in pytables.py:

data = Series(data).str.decode(encoding, errors=errors).values

... and leads to an error when reading hdf-files written with pandas <0.23
In some cases the text data was stored as byte string (i.e. 'x = np.array([b'x', b'y']) )
which raises the following:

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Bug Strings

Most helpful comment

Hello,
is there any update on the issue?
Since i ran into the same problem it would be great if someone could help to solve it..

All 9 comments

@abrakababra : Indeed, that does look odd! Investigation and PR are welcome!

finally done a little research, it's taking the "except-route" here:

def _map(f, arr, na_mask=False, na_value=np.nan, dtype=object):
    if not len(arr):
        return np.ndarray(0, dtype=dtype)

    if isinstance(arr, ABCSeries):
        arr = arr.values
    if not isinstance(arr, np.ndarray):
        arr = np.asarray(arr, dtype=object)
    if na_mask:
        mask = isna(arr)
        try:
            convert = not all(mask)
            result = lib.map_infer_mask(arr, f, mask.view(np.uint8), convert)
--->  except (TypeError, AttributeError) as e:**
            # Reraise the exception if callable `f` got wrong number of args.
            # The user may want to be warned by this, instead of getting NaN
            if compat.PY2:
                p_err = r'takes (no|(exactly|at (least|most)) ?\d+) arguments?'
            else:
                p_err = (r'((takes)|(missing)) (?(2)from \d+ to )?\d+ '
                         r'(?(3)required )positional arguments?')

            if len(e.args) >= 1 and re.search(p_err, e.args[0]):
                raise e

            def g(x):
                try:
                    return f(x)
------->   except (TypeError, AttributeError):
                    return na_value

            return _map(g, arr, dtype=dtype)
        if na_value is not np.nan:
            np.putmask(result, mask, na_value)
            if result.dtype == object:
                result = lib.maybe_convert_objects(result)
        return result
    else:
return lib.map_infer(arr, f)

the first except is triggered by lib.map_infer_maskwhich in my installation is this file
python3libsite-packagespandas_libslib.cp37-win_amd64.pyd

Hello,
is there any update on the issue?
Since i ran into the same problem it would be great if someone could help to solve it..

I would also like to register my hope for a solution to this :smiley:

Just trying to understand how updates on big packages like pandas work. So for this to be fixed, I guess it wouldn't it be as simple as replacing return na_value with return x since _map is very general. It doesn't look like the fix would go in _map right?

So should the fix go somewhere higher up like in Series.str.decode itself? Something like the answers in

https://stackoverflow.com/questions/57361169/how-to-convert-decode-a-pandas-series-of-mixed-bytes-strings-into-string-or-utf

Any updates? Still getting it as of today with 0.25.3, have not tried 1.0.0 though

To best of my knowledge, this issue has not been fixed yet (you should try out 1.0.0 or our master branch to confirm). However, we definitely are open to fixes for this!

I debugged this a bit. My use case is to open a HDF file with HDFStore(). The file was originally created in Python 2, and I'm opening in Python 3. There are some DataFrames stored with Series containing strings.

In pandas.core.strings in _map there is a call to lib.map_infer_mask(arr, f, mask.view(np.uint8), convert). This convert call should convert from bytes to str but it crashes: TypeError: utf_8_decode() argument 2 must be str or None, not numpy.bytes_. This is what triggers the except block @abrakababra mentioned.
The reason is errors is b'strict' and not str('string') and so argument validation fails.

The value for errors comes from HDFStore()._handle._v_attrs in the tables module which does not convert the bytes attributes to str. This is a PyTables bug if you want to call it that way.

I worked around this in my project with:

import functools
import tables

def fix_attrset_init(orig_init):
    if str is bytes:
        return orig_init

    @functools.wraps(orig_init)
    def wrapper(self, node):
        orig_init(self, node)
        for name in self._v_attrnamesuser:
            value = self.__dict__.get(name)
            if isinstance(value, bytes):
                self.__dict__[name] =  value.decode("utf-8")

    return wrapper


tables.attributeset.AttributeSet.__init__ = fix_attrset_init(
    tables.attributeset.AttributeSet.__init__
)

Regarding the OP's code snippet.

x = np.array(['x','y'])
pd.Series(x).str.decode(encoding='UTF-8',errors='strict')

this works in Python 2 because it decodes str object to unicode objects.
I python 3, str.decode does not exist, str IS the unicode representation of strings in memory. So internaly StringMethods.decode throws an error because it tries to decode str object, and in the except block then produces a sequence with empty values.

This works, and in the end it is what PyTables and pandas see, a series of bytes being converted to str.

x = np.array([b'x',b'y'])
pd.Series(x).str.decode(encoding='UTF-8',errors='strict')

The original bug is actually errors being bytes and not str.

I'm running into a similar error, but with a twist.
I have a dataframe with a column that is somehow strings of the form b'{filepath}' and is encoded as float64 according to pd.dtypes(). I generated this list by doing list(dataset.as_numpy_iterator) where dataset is a tf.Data.dataset.

I tried @joaoe's solution, as

images_df["Filenames"] = pd.Series(images_df["Filenames"]).str.decode(encoding='UTF-8',errors='strict')

It throws the following error trace:

AttributeError                            Traceback (most recent call last)
<ipython-input-88-e3f3b418ecda> in <module>
----> 1 class_df["Filename"] = pd.Series(class_df["Filename"]).str.decode(encoding='UTF-8',errors='strict')
      2 class_df.head(10)

/gpfs/loomis/project/dollar/ajj38/conda_envs/py37_torchcuda15/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5268             or name in self._accessors
   5269         ):
-> 5270             return object.__getattribute__(self, name)
   5271         else:
   5272             if self._info_axis._can_hold_identifiers_and_holds_name(name):

/gpfs/loomis/project/dollar/ajj38/conda_envs/py37_torchcuda15/lib/python3.7/site-packages/pandas/core/accessor.py in __get__(self, obj, cls)
    185             # we're accessing the attribute of the class, i.e., Dataset.geo
    186             return self._accessor
--> 187         accessor_obj = self._accessor(obj)
    188         # Replace the property with the accessor object. Inspired by:
    189         # http://www.pydanny.com/cached-property.html

/gpfs/loomis/project/dollar/ajj38/conda_envs/py37_torchcuda15/lib/python3.7/site-packages/pandas/core/strings.py in __init__(self, data)
   2039 
   2040     def __init__(self, data):
-> 2041         self._inferred_dtype = self._validate(data)
   2042         self._is_categorical = is_categorical_dtype(data)
   2043         self._is_string = data.dtype.name == "string"

/gpfs/loomis/project/dollar/ajj38/conda_envs/py37_torchcuda15/lib/python3.7/site-packages/pandas/core/strings.py in _validate(data)
   2096 
   2097         if inferred_dtype not in allowed_types:
-> 2098             raise AttributeError("Can only use .str accessor with string values!")
   2099         return inferred_dtype
   2100 

AttributeError: Can only use .str accessor with string values!

Any idea on what to do? Specifically, how to I convert bytes somehow of type float64 into strings?

Any idea on what to do? Specifically, how to I convert bytes somehow of type float64 into strings?

@auchtopus If you want to convert a series of bytes to str, this is simple enough:

class_df.Filename = class_df.Filename.map(lambda b: b.decode("utf-8"))
Was this page helpful?
0 / 5 - 0 ratings