Trying to read in a file from the catalog that is a text file using CSVDataSet causes the first line of data to become the header for the columns. When I try to use 'header: None', which is a valid argument, I get an error.
I'm basically trying to tell pandas that there is no header and that there are no names so that all the columns default to just integer labels.
Passing that list of arguments as load_args shouldn't cause an error. What should result is a dataframe with integers as column names and the raw data separated by '|' characters. I have done all this before using a local jupyter notebook.
I get an error with the 'header: None' parameter specifically.
ValueError: header must be integer or list of integers
DataSetError: Failed while loading data from data set CSVDataSet(filepath=C:\Users\<username>\Desktop\kedro_tutorial\data\a_raw\raw_data.txt, load_args={'encoding': ISO-8859-1, 'header': None, 'names': None, 'quoting': 3, 'sep': |, 'skiprows': 1}, protocol=file, save_args={'index': False}).
header must be integer or list of integers
Include as many relevant details about the environment in which you experienced the bug:
pip show kedro or kedro -V): Version: 0.16.1python -V): Python 3.7.7Can you provide a minimal working example with an example a.txt file? I (at least) can't reproduce this.
From what I can see, kedro is just bubbling up pandas's error. Do you get the same thing if you run pd.read_csv("data/a_raw/a.txt", header=None, names=None, ...)?
Note that the dataset uses fsspec under the hood and passes a file descriptor to pd.read_csv rather than the path to the file itself. So it _may_ make sense to pass the encoding paramter to fs_args under fs_open_args, (I don't know if the encoding is causing the data to look malformed, so this may or may not be worth a try) like so:
raw_data:
type: pandas.CSVDataSet
filepath: data/a_raw/raw_data.txt
load_args:
sep: "|"
skiprows: 1
header: None
names: None
quoting: 3
fs_args:
open_args_load:
encoding: "ISO-8859-1"
I have had no problems running this bit of code locally on my machine in a Jupyter Notebook without kedro:
a_raw = pd.read_csv(
"C:\\Users\\filepath\\raw_data.txt",
sep="|",
skiprows=1,
header=None,
names=None,
encoding="ISO-8859-1",
quoting=3,
)
For context, the underlying data from the text file looks something like this:
DATA|123|2020-05-03|
As you can see it's pipe separated. The skiprows: 1 above is used to skip past a leading row on the file (I suspect that this is some sort of logging entry regarding when the data was pulled that isn't germane to the analysis I'm trying to do.)
Furthermore, passing 'header=None' into pandas is a valid argument. Yet, every time I try to pass that argument from the catalog via kedro, I get an error. In fact, it's the only way to specify that you have no header (according to the documentation). Without 'header=None', the first row of data ends up becoming the names of the columns which isn't right.
EDIT: I tried the fs_args bit of the code above and it didn't fix the problem.
@Burn1n9m4n
It seems YAML does not treat "None" as Python's None.
Instead, use "null", "Null", "NULL", or simply leave it empty like this:
raw_data:
type: pandas.CSVDataSet
filepath: data/a_raw/raw_data.txt
load_args:
sep: "|"
skiprows: 1
header:
names:
encoding: "ISO-8859-1"
quoting: 3
Reference: https://yaml.org/type/null.html
@Minyus That fixed it! I just left them blank and it seems to have correctly addressed the issue. The output now looks as it should like this:
0 1 2
0 DATA 123 2020-05-03
Thanks!
Good catch Minyus, that completely slipped by me! I'll go ahead and close this issue. Thank you for raising it @Burn1n9m4n :)
@Burn1n9m4n
It seems YAML does not treat "None" as Python's
None.
Instead, use "null", "Null", "NULL", or simply leave it empty like this:raw_data: type: pandas.CSVDataSet filepath: data/a_raw/raw_data.txt load_args: sep: "|" skiprows: 1 header: names: encoding: "ISO-8859-1" quoting: 3Reference: https://yaml.org/type/null.html
Great catch @Minyus! I tend to get confused with yaml as well. When you take a minute to remember that yaml is a superset of JSON, for me at least, this issue makes a lot of sense.
Most helpful comment
@Burn1n9m4n
It seems YAML does not treat "None" as Python's
None.Instead, use "null", "Null", "NULL", or simply leave it empty like this:
Reference: https://yaml.org/type/null.html