Kedro: Allow CSVLocalDataSet and CSVDataSet to accept file-like object

Created on 8 May 2020 · 10Comments · Source: quantumblacklabs/kedro

Description

I got UnicodeDecodeError when I was trying to parse a CSV file (from an external data source) where there are columns uses inconsistent character encodings. I tried various encodings (see below) and got UnicodeDecodeError no matter what encoding I passed in.

Rather than trying to get the data provider to use a consistent encoding (It's an external data source), I would just like to read that column and discard (swallow) the bad chars.

Context

The CSV file I was trying is showed as "Windows-1258" as I open it in Notepad++, the text is mostly plain ASCII, except a few the "bad chars".

I tried with "ascii", "utf-8", "utf-8-sig", "latin1", "windows-1258", "ISO-8859-1", etc. (basically whatever encoding I can get from SO answers for similar questions), but none of them works for me. I don't think any single encoding works here. I got UnicodeDecodeError no matter what encoding I passed in.

It seems it was edited with a non-UTF8 editor (probably in Excel) and contains some character that's not in UTF8.

As I debuged, a bad line that cause the error is something looks like this:
40|Malaysia|"Clorox\xe2\x80\x9d (home cleaning products)|NA

I believe the \xe2\x80\x9d is the "bad char" that cause the error. It seems like it's a special double quote, which I guess is supposed to match the leading ASCII double quote but however the person edit it might be using a special input method.

I believe this kind of mixed-encoding-text-file situation is quite common in real world. Similar "bad character" examples (and possible solutions) are dicussed here and here.

Possible Implementation

The problem is, although Pandas allows to specify encoding, it does not allow to ignore errors not to automatically replace the offending bytes. However, even if Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and Pandas.read_csv does accept a file like object.

For example, this will help me ignore the bad chars:

with open(filepath, encoding='utf8', errors='ignore') as fd:
    pd.read_csv(fd, ...)

However, the current CSVLocalDataSet calls pandas.read_csv BUT assume the filepath is only a path string (even pandas.read_csv does accept file-like object).

A possible implementation might be allow the filepath parameter be a file-like object.

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

Feature Request

Source

yxw

All 10 comments

Hey @yxw! Thanks for raising this.

I _think_ this is actually fixed with the new style datasets. If you use pandas.CSVDataSet we actually pass the file description from fsspec into pandas.read_csv and you can specify the encoding used in the fsspec.open call (which, on local file systems is forwarded to the default open`` in thefs_args` argument.

So you would do something like:

my_dataset:
  type: pandas.CSVDataSet
  fs_args:
    fs_open_args:
      encoding: 'utf-8'

which would do something like:

with self._fs.open("filepath", encoding='utf-8') as f:
  pandas.read_csv(f)

These new kedro.extras.datasets were made available in kedro 0.15.6 I believe, and are now the _only_ option in Kedro 0.16 onwards. Let me know if this is a suitable fix! Although I think the availability of fs_args was only done in 0.16, so you may need to upgrade (but it's a breaking change, so have a look a the RELEASE.md guide for the migration guide).

mzjp2 on 22 May 2020

Hi @yxw , any luck follow @mzjp2 's suggestion above? Extra args like encoding, mode could fix the issue you're facing.

lorenabalan on 27 May 2020

Thanks @mzjp2 for the suggestion. I did tried and the UnicodeDecodeError still exists.

Here is my test case to reproduce it:

import os
import tempfile
import pytest
from kedro.io import (
    DataCatalog,
)

from kedro.extras.datasets.pandas import CSVDataSet


def test_load_ds_with_pandas_csvdataset():
    encoding = 'utf8'
    temp = tempfile.NamedTemporaryFile(mode="w", encoding=encoding, delete=False)

    try:
        buffer = """ID|Name|Products
C00001232DA|Clorox (M) Sdn Bhd|"Clorox” (home cleaning products)

        """
        temp.write(buffer)
        temp.close()

        io = DataCatalog(
            {
                "test_ds": CSVDataSet(temp.name,
                            load_args=dict(
                                sep='|', 
                                encoding=encoding, 
                                skip_blank_lines=True, 
                                error_bad_lines=False, 
                                warn_bad_lines=True, 
                                quotechar=None, 
                                quoting=3)),
            }
        )

        bs_data = io.load("test_ds")
        assert len(bs_data.index) == 1
    finally:
        os.remove(temp.name)

Do notice the different double quote char around Clorox: the ending double quote is a "special" double quote (\xe2\x80\x9d) which break the decoding I think. The error I got is still
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 58: character maps to <undefined>.

I believe the real problem is about mixed encoding (e.g. in a multiple language organization, people may (accidentally) switch to a special input method of their language when typing), which I think pandas.read_csv itself doesn't have a way to process (or swallow) encoding errors. There is similar discussion about this here

What I do to workaround it is something like this:

    def _load(self) -> pd.DataFrame:
        load_path = Path(self._get_load_path())
        encoding = self._load_args.get("encoding", "utf8")
        self._load_args["encoding"] = None
        with load_path.open(encoding=encoding, errors="ignore") as buffer:
            return pd.read_csv(buffer, **self._load_args)

Basically in my case I need the errors='ignore' to "swallow" the unexpected "bad" char.

Do you know if there is a way to pass in the "errors" parameter in catalog.yml to CSVDataSet ?

yxw on 28 May 2020

👍1

Hi @yxw, thank you so much for that snippet -- it makes it so much easier!

That said, I don't seem to get the error on my MacOS laptop (on Kedro 0.16.1 or 0.15.9) - so it might be a windows specific thing? In any case, as long as you're on Kedro 0.16.1, this _should_ (theoretically) work:

import os
import tempfile
from kedro.io import DataCatalog

from kedro.extras.datasets.pandas import CSVDataSet


def test_load_ds_with_pandas_csvdataset():
    encoding = "utf8"
    temp = tempfile.NamedTemporaryFile(mode="w", encoding=encoding, delete=False)

    try:
        buffer = """ID|Name|Products
C00001232DA|Clorox (M) Sdn Bhd|"Clorox” (home cleaning products)

        """
        temp.write(buffer)
        temp.close()

        io = DataCatalog(
            {
                "test_ds": CSVDataSet(
                    temp.name,
                    load_args=dict(
                        sep="|",
                        skip_blank_lines=True,
                        error_bad_lines=False,
                        warn_bad_lines=True,
                        quotechar=None,
                        quoting=3,
                    ),
                    fs_args=dict(
                        open_args_load=dict(encoding=encoding, errors="ignore")
                    ),
                )
            }
        )

        bs_data = io.load("test_ds")
        assert len(bs_data.index) == 1
    finally:
        os.remove(temp.name)

Does that help? The workaround you have is already pretty much what we do, see here:

https://github.com/quantumblacklabs/kedro/blob/d291a21bee56fdd7da4426e817fab43c9ece2302/kedro/extras/datasets/pandas/csv_dataset.py#L152-L156

mzjp2 on 28 May 2020

❤1

Hi @mzjp2 , thanks the tips on the fs_args. I didn't realize that, this is my first Kedro project :).

However, I still have the same UnicodeDecodeError. After my kedro version is 0.15.9 and I'm running on Windows 10. Since you don't have the same error, it might be a Windows specific thing.

yxw on 28 May 2020

Hi @mzjp2 , thanks the tips on the fs_args. I didn't realize that, this is my first Kedro project :).

However, I still have the same UnicodeDecodeError. After my kedro version is 0.15.9 and I'm running on Windows 10. Since you don't have the same error, it might be a Windows specific thing.

The fs_args will only work on 0.16 onwards :(

mzjp2 on 28 May 2020

Oh, I just realized there is two releases just about a weeks ago! I verified releases two weeks ago I thought I have the latest version..
And yes, that's probably the reason. I'll upgrade the version and verify again.

yxw on 28 May 2020

🎉1

Oh, I just realized there is two releases just about a weeks ago! I verified releases two weeks ago I thought I have the latest version..
And yes, that's probably the reason. I'll upgrade the version and verify again.

Awesome. Let me know how you get on :)

mzjp2 on 28 May 2020

Hi @mzjp2 , after migrating my project from 0.15.9 to 0.16.1 (it took some effort), I manage to pass my tests with the new fs_args.open_args_load parameters as shown in your example code above.

Thank you so much!
This issue can be closed.

yxw on 1 Jun 2020

Perfect. Glad to hear. We're trying to improve our release process so that bugfixes aren't packaged with breaking changes and you can benefit from bug fixes/new features without migrating your project. I'll go ahead and close this now :)

mzjp2 on 1 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings