Kedro: Versioning datasets in catalog.yml with flexible naming

Created on 14 May 2020  路  2Comments  路  Source: quantumblacklabs/kedro

Description

I recently read this page about versioning my data sets in the data catalog yaml file: https://kedro.readthedocs.io/en/latest/04_user_guide/04_data_catalog.html#versioning-datasets-and-ml-models.

Consider the following versioned dataset defined in the catalog.yml:

cars.csv:
  type: pandas.CSVDataSet
  filepath: data/01_raw/company/cars.csv
  versioned: True
The DataCatalog will create a versioned CSVDataSet called cars.csv. The actual csv file location will look like data/01_raw/company/cars.csv/<version>/cars.csv, where <version> corresponds to a global save version string formatted as YYYY-MM-DDThh.mm.ss.sssZ.

It seems like the folder name before the mush match with the file name, which in this case is "cars.csv". I am wondering is it possible to get rid of the "cars.csv" folder before the folder in the catalog yaml file?

Context

Let's say I am want to upload my pandas data as a csv file to s3 bucket:

my_data_1:
  type: pandas.CSVDataSet
  filepath: s3://mypath/my_data_1.csv
  versioned: True

And I want to the actual csv file location be s3://mypath/<version>/my_data_1.csv instead of s3://mypath/my_data_1.csv/<version>/my_data_1.csv.
Basically, I want to group my data sets by versions.
For example, when I have another data set that I want to upload to s3 bucket:

my_data_2:
  type: pandas.CSVDataSet
  filepath: s3://mypath/my_data_2.csv
  versioned: True

I want my two data sets my_data_1 and my_data_2 located under the same version directory: s3://mypath/<version>/.
In general, a versioning file path like data/01_raw/company/cars.csv/<version>/cars.csv is not so friendly to read. Because most of my teammates prefer the highest group level be the version, when we run the kedro pipeline.

Questions

Is there a way to remove the `.csv' folder before the folder? Or can we make a flexible file path in catalog yaml?

Thank you so much!
Really enjoy using Kedro!

Question

All 2 comments

Hi @zhangchi1 , it's great to hear you're using Kedro!
The current versioning behaviour actually follows the Spark notation - it's modelling exactly what Spark does under the hood when writing a file to multiple partitions, which is why we prefer the current implementation.
In fairness, you could create a custom dataset which overrides a bunch of methods in AbstractVersionedDataSet, starting from _get_versioned_path(). However we don't recommend it - it would require quite a bit of rewriting effort and possibly increasing complexity.

I am closing this as answered, but please feel free to re-open if there are further concerns, or ask a question on Stackoverflow, which might be better suited for this purpose. 馃槉

Was this page helpful?
0 / 5 - 0 ratings