I'm unable to load a non-utf-8-encoded file.
I know it doesn't work because of https://github.com/quantumblacklabs/kedro/blob/f03226e29b8a018a0f6edab6d3f1a0d37c1b1812/kedro/extras/datasets/pandas/csv_dataset.py#L154-L155. For it to work, encoding would have to be passed to open. However, some file systems don't support an encoding parameter... (e.g. gcsfs, I think).
Try loading https://github.com/beoutbreakprepared/nCoV2019/blob/433628fb828f3b3b3bff7d13195af357fe42e31d/ncov_outside_hubei.csv as a CSVDataSet.
I can load a cp1252-encoded file directly with pandas:
pd.read_csv("data/01_raw/nCoV2019/ncov_outside_hubei/20200304/ncov_outside_hubei.csv", encoding="cp1252")
I'm unable to load a cp1252-encoded file using Kedro:
DataSetError: Failed while loading data from data set CSVDataSet(filepath=data/01_raw/nCoV2019/ncov_outside_hubei/20200304/ncov_outside_hubei.csv, load_args={'encoding': cp1252, 'low_memory': False}, protocol=file, save_args={'index': False}).
'utf-8' codec can't decode byte 0xa0 in position 162456: invalid start byte
Include as many relevant details about the environment in which you experienced the bug:
pip show kedro or kedro -V): 0.15.8python -V): 3.7.610.14.6I have the same error here while trying to read from S3 if my csv file is not encoded in utf-8! Any workarounds?
I have the same error here while trying to read from S3 if my csv file is not encoded in utf-8! Any workarounds?
Hi @millengustavo and @deepyaman, thanks for raising this. We're aware of this (and the other related issues from not being able to pass args to the fsspec open call) and it's on our backlog to fix soon!
For now, the only workaround I can think of is creating a custom dataset from the dataset you want to use and overriding the load and save methods to pass the relevant stuff into the fsspec open call 馃槃
Thanks for the quick reply @mzjp2
I modified /kedro/extras/datasets/pandas/csv_dataset.py _load function by passing encoding with my specific case to the open context.
def _load(self) -> pd.DataFrame:
load_path = get_filepath_str(self._get_load_path(), self._protocol)
with self._fs.open(load_path, encoding="latin_1", mode="r") as fs_file:
return pd.read_csv(fs_file, **self._load_args)
Since most of my files have the same encoding it worked, despite not being ideal.
Thanks for the quick reply @mzjp2
I modified
/kedro/extras/datasets/pandas/csv_dataset.py_load function by passing encoding with my specific case to the open context.def _load(self) -> pd.DataFrame: load_path = get_filepath_str(self._get_load_path(), self._protocol) with self._fs.open(load_path, encoding="latin_1", mode="r") as fs_file: return pd.read_csv(fs_file, **self._load_args)Since most of my files have the same encoding it worked, despite not being ideal.
Alternatively, you can do something like
def _load(self) -> pd.DataFrame:
load_path = get_filepath_str(self._get_load_path(), self._protocol)
with self._fs.open(load_path, **fs_open_kwargs) as fs_file:
return pd.read_csv(fs_file, **self._load_args)
in a similar vein to the way we do load_args or save_args then pass the relevant parameters inside your catalog.yml entry. :)
Or, if you only care about encoding, then you can make encoding one of your __init__ args and pass just encoding=self._encoding to the self._fs.open call. Hope that makes sense!
Hi everyone, this should be fixed by this commit and made available in the next release: https://github.com/quantumblacklabs/kedro/commit/8329f452b96bda4ca3a3cd2b30f71f261e6a8af8
I'll go ahead and close this issue for now, feel free to come back if it's still causing you problems.
I am still facing the issue:
`utf-8' codec can't decode byte
DataSetError: Failed while loading data from data set CSVDataSet(filepath=..., load_args={'encoding': latin_1}, protocol=file, save_args={'index': False}).`
while i specified
load_args:
encoding: 'latin_1'
@bensdm According https://stackoverflow.com/a/30470630/3858528 the value of encoding should be latin1, not latin_1. Hope this helps :)
nop it doesnt change anything, i already tried but same issue:
Failed while loading data from data set CSVDataSet(filepath= ..., load_args={'encoding': latin1, 'sep': ;}, protocol=file, save_args={'index': False}).
'utf-8' codec can't decode byte 0xe9 in position 222: invalid continuation byte
@bensdm Are you able to read your data using regular Pandas (not using Kedro CSVDataSet) ?
Yes with both latin1 and latin_1
should i open a new issue?
Hi @bensdm , apologies for the delay. Could you try adding the following config to your catalog entry?
fs_args:
open_args_load:
mode: "rb"
We use fsspec underneath and it needs to open the file in binary mode. Try passing encoding to open_args_load or load_args as well, and that should work. If no combination works please feel free to open a new issue.
thanks, working as expected !
However it is not working when working with partitioned dataset:
DataSetError:
__init__() got an unexpected keyword argument 'fs_args'.
DataSet 'absences_raw' must only contain arguments valid for the constructor ofkedro.io.partitioned_data_set.PartitionedDataSet.