Kedro: [KED-1473] `pandas.CSVDataSet` doesn't support `encoding` parameter

Created on 16 Mar 2020  路  14Comments  路  Source: quantumblacklabs/kedro

Description

I'm unable to load a non-utf-8-encoded file.

I know it doesn't work because of https://github.com/quantumblacklabs/kedro/blob/f03226e29b8a018a0f6edab6d3f1a0d37c1b1812/kedro/extras/datasets/pandas/csv_dataset.py#L154-L155. For it to work, encoding would have to be passed to open. However, some file systems don't support an encoding parameter... (e.g. gcsfs, I think).

Steps to Reproduce

Try loading https://github.com/beoutbreakprepared/nCoV2019/blob/433628fb828f3b3b3bff7d13195af357fe42e31d/ncov_outside_hubei.csv as a CSVDataSet.

Expected Result

I can load a cp1252-encoded file directly with pandas:

pd.read_csv("data/01_raw/nCoV2019/ncov_outside_hubei/20200304/ncov_outside_hubei.csv", encoding="cp1252")

Actual Result

I'm unable to load a cp1252-encoded file using Kedro:

DataSetError: Failed while loading data from data set CSVDataSet(filepath=data/01_raw/nCoV2019/ncov_outside_hubei/20200304/ncov_outside_hubei.csv, load_args={'encoding': cp1252, 'low_memory': False}, protocol=file, save_args={'index': False}).
'utf-8' codec can't decode byte 0xa0 in position 162456: invalid start byte

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): 0.15.8
  • Python version used (python -V): 3.7.6
  • Operating system and version: macOS Mojave Version 10.14.6
Bug Report

All 14 comments

I have the same error here while trying to read from S3 if my csv file is not encoded in utf-8! Any workarounds?

I have the same error here while trying to read from S3 if my csv file is not encoded in utf-8! Any workarounds?

Hi @millengustavo and @deepyaman, thanks for raising this. We're aware of this (and the other related issues from not being able to pass args to the fsspec open call) and it's on our backlog to fix soon!

For now, the only workaround I can think of is creating a custom dataset from the dataset you want to use and overriding the load and save methods to pass the relevant stuff into the fsspec open call 馃槃

Thanks for the quick reply @mzjp2

I modified /kedro/extras/datasets/pandas/csv_dataset.py _load function by passing encoding with my specific case to the open context.

def _load(self) -> pd.DataFrame:
        load_path = get_filepath_str(self._get_load_path(), self._protocol)

        with self._fs.open(load_path, encoding="latin_1", mode="r") as fs_file:
            return pd.read_csv(fs_file, **self._load_args)

Since most of my files have the same encoding it worked, despite not being ideal.

Thanks for the quick reply @mzjp2

I modified /kedro/extras/datasets/pandas/csv_dataset.py _load function by passing encoding with my specific case to the open context.

def _load(self) -> pd.DataFrame:
        load_path = get_filepath_str(self._get_load_path(), self._protocol)

        with self._fs.open(load_path, encoding="latin_1", mode="r") as fs_file:
            return pd.read_csv(fs_file, **self._load_args)

Since most of my files have the same encoding it worked, despite not being ideal.

Alternatively, you can do something like

def _load(self) -> pd.DataFrame:
        load_path = get_filepath_str(self._get_load_path(), self._protocol)

        with self._fs.open(load_path, **fs_open_kwargs) as fs_file:
            return pd.read_csv(fs_file, **self._load_args)

in a similar vein to the way we do load_args or save_args then pass the relevant parameters inside your catalog.yml entry. :)

Or, if you only care about encoding, then you can make encoding one of your __init__ args and pass just encoding=self._encoding to the self._fs.open call. Hope that makes sense!

Hi everyone, this should be fixed by this commit and made available in the next release: https://github.com/quantumblacklabs/kedro/commit/8329f452b96bda4ca3a3cd2b30f71f261e6a8af8
I'll go ahead and close this issue for now, feel free to come back if it's still causing you problems.

I am still facing the issue:
`utf-8' codec can't decode byte

DataSetError: Failed while loading data from data set CSVDataSet(filepath=..., load_args={'encoding': latin_1}, protocol=file, save_args={'index': False}).`
while i specified

load_args:
    encoding: 'latin_1'

@bensdm According https://stackoverflow.com/a/30470630/3858528 the value of encoding should be latin1, not latin_1. Hope this helps :)

nop it doesnt change anything, i already tried but same issue:
Failed while loading data from data set CSVDataSet(filepath= ..., load_args={'encoding': latin1, 'sep': ;}, protocol=file, save_args={'index': False}). 'utf-8' codec can't decode byte 0xe9 in position 222: invalid continuation byte

@bensdm Are you able to read your data using regular Pandas (not using Kedro CSVDataSet) ?

Yes with both latin1 and latin_1

should i open a new issue?

Hi @bensdm , apologies for the delay. Could you try adding the following config to your catalog entry?

fs_args:
    open_args_load:
        mode: "rb"

We use fsspec underneath and it needs to open the file in binary mode. Try passing encoding to open_args_load or load_args as well, and that should work. If no combination works please feel free to open a new issue.

thanks, working as expected !

However it is not working when working with partitioned dataset:
DataSetError: __init__() got an unexpected keyword argument 'fs_args'. DataSet 'absences_raw' must only contain arguments valid for the constructor ofkedro.io.partitioned_data_set.PartitionedDataSet.

Was this page helpful?
0 / 5 - 0 ratings