Kedro: Pandas not reading UTF-8

Created on 5 Jun 2020  路  2Comments  路  Source: quantumblacklabs/kedro

Description

While reading a CSV file as pandas.CSVDataset an error is thrown, stating that the ascii codec can't decode a character, but utf-8 was explicitly set for decoding this file and also an escape character is defined.

Context

The raw-data file is exported from an external system and contains utf-8 characters. Although defined as utf-8 the error indicates that it assumes to read only ascii characters in the 0 - 128 range.

Steps to Reproduce

  1. Define a dataset containing '茫', or '帽' in string:
test_date:
  type: pandas.CSVDataSet
  filepath: data/01_raw/test_data.csv
  load_args:
    sep: ','
    escapechar: '\'
    encoding: 'utf_8'
  1. In jupyter notebook:
from kedro.framework.context import load_context
context = load_context("../")
catalog = context.catalog
test_data = catalog.load("test_data")

Expected Result

test_data should be a pandas Dataframe.

Actual Result

The process stops and throws an error.

DataSetError: Failed while loading data from data set CSVDataSet(filepath=/Users/../data/01_raw/file.csv, load_args={'encoding': utf_8, 'escapechar': \, 'sep': ,}, protocol=file, save_args={'index': False}).
'ascii' codec can't decode byte 0xc3 in position 202371: ordinal not in range(128)

Your Environment

  • Kedro version used (pip show kedro or kedro -V): 0.16.1
  • Python version used (python -V): 3.7.3
  • Operating system and version: Mac OSX Catalina
Bug Report

Most helpful comment

So what we do here is that we use fsspec to load the file and then pass the file descriptor to pandas. We provide configuration for the fsspec arguments, so can you try the following:

test_date:
  type: pandas.CSVDataSet
  filepath: data/01_raw/test_data.csv
  fs_args:
    open_args_load:
      encoding: 'utf_8'
  load_args:
    sep: ','
    escapechar: '\'
    encoding: 'utf_8'

and different combinations of including encoding in open_args_load and load_args?

All 2 comments

So what we do here is that we use fsspec to load the file and then pass the file descriptor to pandas. We provide configuration for the fsspec arguments, so can you try the following:

test_date:
  type: pandas.CSVDataSet
  filepath: data/01_raw/test_data.csv
  fs_args:
    open_args_load:
      encoding: 'utf_8'
  load_args:
    sep: ','
    escapechar: '\'
    encoding: 'utf_8'

and different combinations of including encoding in open_args_load and load_args?

@mzjp2 that solved my issue. Thanks for this.
I was not aware of this required configuration, but great tip!

Was this page helpful?
0 / 5 - 0 ratings