Hello,
this looks like a bug:
x = np.array(['x','y'])
pd.Series(x).str.decode(encoding='UTF-8',errors='strict')
0 NaN
1 NaN
dtype: float64
the line above is also used in pytables.py:
data = Series(data).str.decode(encoding, errors=errors).values
... and leads to an error when reading hdf-files written with pandas <0.23
In some cases the text data was stored as byte string (i.e. 'x = np.array([b'x', b'y']) )
which raises the following:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
@abrakababra : Indeed, that does look odd! Investigation and PR are welcome!
finally done a little research, it's taking the "except-route" here:
def _map(f, arr, na_mask=False, na_value=np.nan, dtype=object):
if not len(arr):
return np.ndarray(0, dtype=dtype)
if isinstance(arr, ABCSeries):
arr = arr.values
if not isinstance(arr, np.ndarray):
arr = np.asarray(arr, dtype=object)
if na_mask:
mask = isna(arr)
try:
convert = not all(mask)
result = lib.map_infer_mask(arr, f, mask.view(np.uint8), convert)
---> except (TypeError, AttributeError) as e:**
# Reraise the exception if callable `f` got wrong number of args.
# The user may want to be warned by this, instead of getting NaN
if compat.PY2:
p_err = r'takes (no|(exactly|at (least|most)) ?\d+) arguments?'
else:
p_err = (r'((takes)|(missing)) (?(2)from \d+ to )?\d+ '
r'(?(3)required )positional arguments?')
if len(e.args) >= 1 and re.search(p_err, e.args[0]):
raise e
def g(x):
try:
return f(x)
-------> except (TypeError, AttributeError):
return na_value
return _map(g, arr, dtype=dtype)
if na_value is not np.nan:
np.putmask(result, mask, na_value)
if result.dtype == object:
result = lib.maybe_convert_objects(result)
return result
else:
return lib.map_infer(arr, f)
the first except is triggered by lib.map_infer_maskwhich in my installation is this file
python3libsite-packagespandas_libslib.cp37-win_amd64.pyd
Hello,
is there any update on the issue?
Since i ran into the same problem it would be great if someone could help to solve it..
I would also like to register my hope for a solution to this :smiley:
Just trying to understand how updates on big packages like pandas work. So for this to be fixed, I guess it wouldn't it be as simple as replacing return na_value with return x since _map is very general. It doesn't look like the fix would go in _map right?
So should the fix go somewhere higher up like in Series.str.decode itself? Something like the answers in
Any updates? Still getting it as of today with 0.25.3, have not tried 1.0.0 though
To best of my knowledge, this issue has not been fixed yet (you should try out 1.0.0 or our master branch to confirm). However, we definitely are open to fixes for this!
I debugged this a bit. My use case is to open a HDF file with HDFStore(). The file was originally created in Python 2, and I'm opening in Python 3. There are some DataFrames stored with Series containing strings.
In pandas.core.strings in _map there is a call to lib.map_infer_mask(arr, f, mask.view(np.uint8), convert). This convert call should convert from bytes to str but it crashes: TypeError: utf_8_decode() argument 2 must be str or None, not numpy.bytes_. This is what triggers the except block @abrakababra mentioned.
The reason is errors is b'strict' and not str('string') and so argument validation fails.
The value for errors comes from HDFStore()._handle._v_attrs in the tables module which does not convert the bytes attributes to str. This is a PyTables bug if you want to call it that way.
I worked around this in my project with:
import functools
import tables
def fix_attrset_init(orig_init):
if str is bytes:
return orig_init
@functools.wraps(orig_init)
def wrapper(self, node):
orig_init(self, node)
for name in self._v_attrnamesuser:
value = self.__dict__.get(name)
if isinstance(value, bytes):
self.__dict__[name] = value.decode("utf-8")
return wrapper
tables.attributeset.AttributeSet.__init__ = fix_attrset_init(
tables.attributeset.AttributeSet.__init__
)
Regarding the OP's code snippet.
x = np.array(['x','y'])
pd.Series(x).str.decode(encoding='UTF-8',errors='strict')
this works in Python 2 because it decodes str object to unicode objects.
I python 3, str.decode does not exist, str IS the unicode representation of strings in memory. So internaly StringMethods.decode throws an error because it tries to decode str object, and in the except block then produces a sequence with empty values.
This works, and in the end it is what PyTables and pandas see, a series of bytes being converted to str.
x = np.array([b'x',b'y'])
pd.Series(x).str.decode(encoding='UTF-8',errors='strict')
The original bug is actually errors being bytes and not str.
I'm running into a similar error, but with a twist.
I have a dataframe with a column that is somehow strings of the form b'{filepath}' and is encoded as float64 according to pd.dtypes(). I generated this list by doing list(dataset.as_numpy_iterator) where dataset is a tf.Data.dataset.
I tried @joaoe's solution, as
images_df["Filenames"] = pd.Series(images_df["Filenames"]).str.decode(encoding='UTF-8',errors='strict')
It throws the following error trace:
AttributeError Traceback (most recent call last)
<ipython-input-88-e3f3b418ecda> in <module>
----> 1 class_df["Filename"] = pd.Series(class_df["Filename"]).str.decode(encoding='UTF-8',errors='strict')
2 class_df.head(10)
/gpfs/loomis/project/dollar/ajj38/conda_envs/py37_torchcuda15/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
5268 or name in self._accessors
5269 ):
-> 5270 return object.__getattribute__(self, name)
5271 else:
5272 if self._info_axis._can_hold_identifiers_and_holds_name(name):
/gpfs/loomis/project/dollar/ajj38/conda_envs/py37_torchcuda15/lib/python3.7/site-packages/pandas/core/accessor.py in __get__(self, obj, cls)
185 # we're accessing the attribute of the class, i.e., Dataset.geo
186 return self._accessor
--> 187 accessor_obj = self._accessor(obj)
188 # Replace the property with the accessor object. Inspired by:
189 # http://www.pydanny.com/cached-property.html
/gpfs/loomis/project/dollar/ajj38/conda_envs/py37_torchcuda15/lib/python3.7/site-packages/pandas/core/strings.py in __init__(self, data)
2039
2040 def __init__(self, data):
-> 2041 self._inferred_dtype = self._validate(data)
2042 self._is_categorical = is_categorical_dtype(data)
2043 self._is_string = data.dtype.name == "string"
/gpfs/loomis/project/dollar/ajj38/conda_envs/py37_torchcuda15/lib/python3.7/site-packages/pandas/core/strings.py in _validate(data)
2096
2097 if inferred_dtype not in allowed_types:
-> 2098 raise AttributeError("Can only use .str accessor with string values!")
2099 return inferred_dtype
2100
AttributeError: Can only use .str accessor with string values!
Any idea on what to do? Specifically, how to I convert bytes somehow of type float64 into strings?
Any idea on what to do? Specifically, how to I convert bytes somehow of type float64 into strings?
@auchtopus If you want to convert a series of bytes to str, this is simple enough:
class_df.Filename = class_df.Filename.map(lambda b: b.decode("utf-8"))
Most helpful comment
Hello,
is there any update on the issue?
Since i ran into the same problem it would be great if someone could help to solve it..