Kedro: Error trying to use "header: None" within catalog.yml

Created on 28 May 2020 · 6Comments · Source: quantumblacklabs/kedro

Description

Trying to read in a file from the catalog that is a text file using CSVDataSet causes the first line of data to become the header for the columns. When I try to use 'header: None', which is a valid argument, I get an error.

Context

I'm basically trying to tell pandas that there is no header and that there are no names so that all the columns default to just integer labels.

Steps to Reproduce

Place .txt file into the raw layer
Alter catalog to read in data file using the following:
raw_data:
type: pandas.CSVDataSet
filepath: data/a_raw/raw_data.txt
load_args:
sep: "|"
skiprows: 1
header: None
names: None
encoding: "ISO-8859-1"
quoting: 3
Save catalog.yml with changes
Run 'kedro ipython' from directory and within kedro environment
Run following commands:
x = catalog.load('raw_data')
Code produces error

Expected Result

Passing that list of arguments as load_args shouldn't cause an error. What should result is a dataframe with integers as column names and the raw data separated by '|' characters. I have done all this before using a local jupyter notebook.

Actual Result

I get an error with the 'header: None' parameter specifically.

ValueError: header must be integer or list of integers

DataSetError: Failed while loading data from data set CSVDataSet(filepath=C:\Users\<username>\Desktop\kedro_tutorial\data\a_raw\raw_data.txt, load_args={'encoding': ISO-8859-1, 'header': None, 'names': None, 'quoting': 3, 'sep': |, 'skiprows': 1}, protocol=file, save_args={'index': False}).
header must be integer or list of integers

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used (pip show kedro or kedro -V): Version: 0.16.1
Python version used (python -V): Python 3.7.7
Operating system and version: Windows 10 Pro Version 10.0.17763 Build 17763

Bug Report

Source

Burn1n9m4n

👍1

Most helpful comment

@Burn1n9m4n

It seems YAML does not treat "None" as Python's None.
Instead, use "null", "Null", "NULL", or simply leave it empty like this:

raw_data:
  type: pandas.CSVDataSet
  filepath: data/a_raw/raw_data.txt
  load_args:
  sep: "|"
  skiprows: 1
  header:
  names:
  encoding: "ISO-8859-1"
  quoting: 3

Reference: https://yaml.org/type/null.html

Minyus on 29 May 2020

👍4

All 6 comments

Can you provide a minimal working example with an example a.txt file? I (at least) can't reproduce this.

From what I can see, kedro is just bubbling up pandas's error. Do you get the same thing if you run pd.read_csv("data/a_raw/a.txt", header=None, names=None, ...)?

Note that the dataset uses fsspec under the hood and passes a file descriptor to pd.read_csv rather than the path to the file itself. So it _may_ make sense to pass the encoding paramter to fs_args under fs_open_args, (I don't know if the encoding is causing the data to look malformed, so this may or may not be worth a try) like so:

raw_data:
  type: pandas.CSVDataSet
  filepath: data/a_raw/raw_data.txt
  load_args:
    sep: "|"
    skiprows: 1
    header: None
    names: None
    quoting: 3
  fs_args:
    open_args_load:
          encoding: "ISO-8859-1"

mzjp2 on 28 May 2020

I have had no problems running this bit of code locally on my machine in a Jupyter Notebook without kedro:
a_raw = pd.read_csv( "C:\\Users\\filepath\\raw_data.txt", sep="|", skiprows=1, header=None, names=None, encoding="ISO-8859-1", quoting=3, )

For context, the underlying data from the text file looks something like this:
DATA|123|2020-05-03|

As you can see it's pipe separated. The skiprows: 1 above is used to skip past a leading row on the file (I suspect that this is some sort of logging entry regarding when the data was pulled that isn't germane to the analysis I'm trying to do.)

Furthermore, passing 'header=None' into pandas is a valid argument. Yet, every time I try to pass that argument from the catalog via kedro, I get an error. In fact, it's the only way to specify that you have no header (according to the documentation). Without 'header=None', the first row of data ends up becoming the names of the columns which isn't right.

EDIT: I tried the fs_args bit of the code above and it didn't fix the problem.

Burn1n9m4n on 29 May 2020

@Burn1n9m4n

It seems YAML does not treat "None" as Python's None.
Instead, use "null", "Null", "NULL", or simply leave it empty like this:

raw_data:
  type: pandas.CSVDataSet
  filepath: data/a_raw/raw_data.txt
  load_args:
  sep: "|"
  skiprows: 1
  header:
  names:
  encoding: "ISO-8859-1"
  quoting: 3

Reference: https://yaml.org/type/null.html

Minyus on 29 May 2020

👍4

@Minyus That fixed it! I just left them blank and it seems to have correctly addressed the issue. The output now looks as it should like this:

    0         1            2
0   DATA     123   2020-05-03

Thanks!

Burn1n9m4n on 29 May 2020

Good catch Minyus, that completely slipped by me! I'll go ahead and close this issue. Thank you for raising it @Burn1n9m4n :)

mzjp2 on 29 May 2020

@Burn1n9m4n

It seems YAML does not treat "None" as Python's None.
Instead, use "null", "Null", "NULL", or simply leave it empty like this:
raw_data:
  type: pandas.CSVDataSet
  filepath: data/a_raw/raw_data.txt
  load_args:
  sep: "|"
  skiprows: 1
  header:
  names:
  encoding: "ISO-8859-1"
  quoting: 3
Reference: https://yaml.org/type/null.html

Great catch @Minyus! I tend to get confused with yaml as well. When you take a minute to remember that yaml is a superset of JSON, for me at least, this issue makes a lot of sense.