I recently read this page about versioning my data sets in the data catalog yaml file: https://kedro.readthedocs.io/en/latest/04_user_guide/04_data_catalog.html#versioning-datasets-and-ml-models.
Consider the following versioned dataset defined in the catalog.yml:
cars.csv:
type: pandas.CSVDataSet
filepath: data/01_raw/company/cars.csv
versioned: True
The DataCatalog will create a versioned CSVDataSet called cars.csv. The actual csv file location will look like data/01_raw/company/cars.csv/<version>/cars.csv, where <version> corresponds to a global save version string formatted as YYYY-MM-DDThh.mm.ss.sssZ.
It seems like the folder name before the
Let's say I am want to upload my pandas data as a csv file to s3 bucket:
my_data_1:
type: pandas.CSVDataSet
filepath: s3://mypath/my_data_1.csv
versioned: True
And I want to the actual csv file location be s3://mypath/<version>/my_data_1.csv instead of s3://mypath/my_data_1.csv/<version>/my_data_1.csv.
Basically, I want to group my data sets by versions.
For example, when I have another data set that I want to upload to s3 bucket:
my_data_2:
type: pandas.CSVDataSet
filepath: s3://mypath/my_data_2.csv
versioned: True
I want my two data sets my_data_1 and my_data_2 located under the same version directory: s3://mypath/<version>/.
In general, a versioning file path like data/01_raw/company/cars.csv/<version>/cars.csv is not so friendly to read. Because most of my teammates prefer the highest group level be the version, when we run the kedro pipeline.
Is there a way to remove the `.csv' folder before the
Thank you so much!
Really enjoy using Kedro!
Hi @zhangchi1 , it's great to hear you're using Kedro!
The current versioning behaviour actually follows the Spark notation - it's modelling exactly what Spark does under the hood when writing a file to multiple partitions, which is why we prefer the current implementation.
In fairness, you could create a custom dataset which overrides a bunch of methods in AbstractVersionedDataSet, starting from _get_versioned_path(). However we don't recommend it - it would require quite a bit of rewriting effort and possibly increasing complexity.
I am closing this as answered, but please feel free to re-open if there are further concerns, or ask a question on Stackoverflow, which might be better suited for this purpose. 馃槉