I got UnicodeDecodeError when I was trying to parse a CSV file (from an external data source) where there are columns uses inconsistent character encodings. I tried various encodings (see below) and got UnicodeDecodeError no matter what encoding I passed in.
Rather than trying to get the data provider to use a consistent encoding (It's an external data source), I would just like to read that column and discard (swallow) the bad chars.
The CSV file I was trying is showed as "Windows-1258" as I open it in Notepad++, the text is mostly plain ASCII, except a few the "bad chars".
I tried with "ascii", "utf-8", "utf-8-sig", "latin1", "windows-1258", "ISO-8859-1", etc. (basically whatever encoding I can get from SO answers for similar questions), but none of them works for me. I don't think any single encoding works here. I got UnicodeDecodeError no matter what encoding I passed in.
It seems it was edited with a non-UTF8 editor (probably in Excel) and contains some character that's not in UTF8.
As I debuged, a bad line that cause the error is something looks like this:
40|Malaysia|"Clorox\xe2\x80\x9d (home cleaning products)|NA
I believe the \xe2\x80\x9d is the "bad char" that cause the error. It seems like it's a special double quote, which I guess is supposed to match the leading ASCII double quote but however the person edit it might be using a special input method.
I believe this kind of mixed-encoding-text-file situation is quite common in real world. Similar "bad character" examples (and possible solutions) are dicussed here and here.
The problem is, although Pandas allows to specify encoding, it does not allow to ignore errors not to automatically replace the offending bytes. However, even if Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and Pandas.read_csv does accept a file like object.
For example, this will help me ignore the bad chars:
with open(filepath, encoding='utf8', errors='ignore') as fd:
pd.read_csv(fd, ...)
However, the current CSVLocalDataSet calls pandas.read_csv BUT assume the filepath is only a path string (even pandas.read_csv does accept file-like object).
A possible implementation might be allow the filepath parameter be a file-like object.
(Optional) Describe any alternative solutions or features you've considered.
Hey @yxw! Thanks for raising this.
I _think_ this is actually fixed with the new style datasets. If you use pandas.CSVDataSet we actually pass the file description from fsspec into pandas.read_csv and you can specify the encoding used in the fsspec.open call (which, on local file systems is forwarded to the default open`` in thefs_args` argument.
So you would do something like:
my_dataset:
type: pandas.CSVDataSet
fs_args:
fs_open_args:
encoding: 'utf-8'
which would do something like:
with self._fs.open("filepath", encoding='utf-8') as f:
pandas.read_csv(f)
These new kedro.extras.datasets were made available in kedro 0.15.6 I believe, and are now the _only_ option in Kedro 0.16 onwards. Let me know if this is a suitable fix! Although I think the availability of fs_args was only done in 0.16, so you may need to upgrade (but it's a breaking change, so have a look a the RELEASE.md guide for the migration guide).
Hi @yxw , any luck follow @mzjp2 's suggestion above? Extra args like encoding, mode could fix the issue you're facing.
Thanks @mzjp2 for the suggestion. I did tried and the UnicodeDecodeError still exists.
Here is my test case to reproduce it:
import os
import tempfile
import pytest
from kedro.io import (
DataCatalog,
)
from kedro.extras.datasets.pandas import CSVDataSet
def test_load_ds_with_pandas_csvdataset():
encoding = 'utf8'
temp = tempfile.NamedTemporaryFile(mode="w", encoding=encoding, delete=False)
try:
buffer = """ID|Name|Products
C00001232DA|Clorox (M) Sdn Bhd|"Clorox” (home cleaning products)
"""
temp.write(buffer)
temp.close()
io = DataCatalog(
{
"test_ds": CSVDataSet(temp.name,
load_args=dict(
sep='|',
encoding=encoding,
skip_blank_lines=True,
error_bad_lines=False,
warn_bad_lines=True,
quotechar=None,
quoting=3)),
}
)
bs_data = io.load("test_ds")
assert len(bs_data.index) == 1
finally:
os.remove(temp.name)
Do notice the different double quote char around Clorox: the ending double quote is a "special" double quote (\xe2\x80\x9d) which break the decoding I think. The error I got is still
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 58: character maps to <undefined>.
I believe the real problem is about mixed encoding (e.g. in a multiple language organization, people may (accidentally) switch to a special input method of their language when typing), which I think pandas.read_csv itself doesn't have a way to process (or swallow) encoding errors. There is similar discussion about this here
What I do to workaround it is something like this:
def _load(self) -> pd.DataFrame:
load_path = Path(self._get_load_path())
encoding = self._load_args.get("encoding", "utf8")
self._load_args["encoding"] = None
with load_path.open(encoding=encoding, errors="ignore") as buffer:
return pd.read_csv(buffer, **self._load_args)
Basically in my case I need the errors='ignore' to "swallow" the unexpected "bad" char.
Do you know if there is a way to pass in the "errors" parameter in catalog.yml to CSVDataSet ?
Hi @yxw, thank you so much for that snippet -- it makes it so much easier!
That said, I don't seem to get the error on my MacOS laptop (on Kedro 0.16.1 or 0.15.9) - so it might be a windows specific thing? In any case, as long as you're on Kedro 0.16.1, this _should_ (theoretically) work:
import os
import tempfile
from kedro.io import DataCatalog
from kedro.extras.datasets.pandas import CSVDataSet
def test_load_ds_with_pandas_csvdataset():
encoding = "utf8"
temp = tempfile.NamedTemporaryFile(mode="w", encoding=encoding, delete=False)
try:
buffer = """ID|Name|Products
C00001232DA|Clorox (M) Sdn Bhd|"Clorox” (home cleaning products)
"""
temp.write(buffer)
temp.close()
io = DataCatalog(
{
"test_ds": CSVDataSet(
temp.name,
load_args=dict(
sep="|",
skip_blank_lines=True,
error_bad_lines=False,
warn_bad_lines=True,
quotechar=None,
quoting=3,
),
fs_args=dict(
open_args_load=dict(encoding=encoding, errors="ignore")
),
)
}
)
bs_data = io.load("test_ds")
assert len(bs_data.index) == 1
finally:
os.remove(temp.name)
Does that help? The workaround you have is already pretty much what we do, see here:
Hi @mzjp2 , thanks the tips on the fs_args. I didn't realize that, this is my first Kedro project :).
However, I still have the same UnicodeDecodeError. After my kedro version is 0.15.9 and I'm running on Windows 10. Since you don't have the same error, it might be a Windows specific thing.
Hi @mzjp2 , thanks the tips on the
fs_args. I didn't realize that, this is my first Kedro project :).However, I still have the same
UnicodeDecodeError. After my kedro version is 0.15.9 and I'm running on Windows 10. Since you don't have the same error, it might be a Windows specific thing.
The fs_args will only work on 0.16 onwards :(
Oh, I just realized there is two releases just about a weeks ago! I verified releases two weeks ago I thought I have the latest version..
And yes, that's probably the reason. I'll upgrade the version and verify again.
Oh, I just realized there is two releases just about a weeks ago! I verified releases two weeks ago I thought I have the latest version..
And yes, that's probably the reason. I'll upgrade the version and verify again.
Awesome. Let me know how you get on :)
Hi @mzjp2 , after migrating my project from 0.15.9 to 0.16.1 (it took some effort), I manage to pass my tests with the new fs_args.open_args_load parameters as shown in your example code above.
Thank you so much!
This issue can be closed.
Perfect. Glad to hear. We're trying to improve our release process so that bugfixes aren't packaged with breaking changes and you can benefit from bug fixes/new features without migrating your project. I'll go ahead and close this now :)