While reading a CSV file as pandas.CSVDataset an error is thrown, stating that the ascii codec can't decode a character, but utf-8 was explicitly set for decoding this file and also an escape character is defined.
The raw-data file is exported from an external system and contains utf-8 characters. Although defined as utf-8 the error indicates that it assumes to read only ascii characters in the 0 - 128 range.
test_date:
type: pandas.CSVDataSet
filepath: data/01_raw/test_data.csv
load_args:
sep: ','
escapechar: '\'
encoding: 'utf_8'
from kedro.framework.context import load_context
context = load_context("../")
catalog = context.catalog
test_data = catalog.load("test_data")
test_data should be a pandas Dataframe.
The process stops and throws an error.
DataSetError: Failed while loading data from data set CSVDataSet(filepath=/Users/../data/01_raw/file.csv, load_args={'encoding': utf_8, 'escapechar': \, 'sep': ,}, protocol=file, save_args={'index': False}).
'ascii' codec can't decode byte 0xc3 in position 202371: ordinal not in range(128)
pip show kedro or kedro -V): 0.16.1python -V): 3.7.3So what we do here is that we use fsspec to load the file and then pass the file descriptor to pandas. We provide configuration for the fsspec arguments, so can you try the following:
test_date:
type: pandas.CSVDataSet
filepath: data/01_raw/test_data.csv
fs_args:
open_args_load:
encoding: 'utf_8'
load_args:
sep: ','
escapechar: '\'
encoding: 'utf_8'
and different combinations of including encoding in open_args_load and load_args?
@mzjp2 that solved my issue. Thanks for this.
I was not aware of this required configuration, but great tip!
Most helpful comment
So what we do here is that we use
fsspecto load the file and then pass the file descriptor topandas. We provide configuration for thefsspecarguments, so can you try the following:and different combinations of including
encodinginopen_args_loadandload_args?