Kedro: [KED-1619] catalog.yaml configuration reference

Created on 2 May 2020  Â·  10Comments  Â·  Source: quantumblacklabs/kedro

I'd like to understand what valid entries in catalog.yaml files are. Searching the kedro docs got me to The Data Catalog. However the docs is an example, not really a reference. I was looking for something like e.g. GitLab's .gitlab-ci.yml configuration reference.

Question

Most helpful comment

I believe this issue should be resolved with the jsonschema now provided within the kedro repository here: https://github.com/quantumblacklabs/kedro/tree/develop/static/jsonschema

They can be read standalone (but they are autogenerated and can be a little tedious to read) or integrated into your editor of choice (there is documentation provided at the latest version of the docs for how to integrate it with VSCode and PyCharm, the instructions for setting it up with any other editor should be similar.

All 10 comments

Hi, @fkromer thank you for your question.
Here is the list of datasets you can register in catalog.yml as type. https://kedro.readthedocs.io/en/latest/kedro.extras.datasets.html

But if you have any specific question, feel free to ask us here or StackOverflow with kedro tag.

I can see the information is a bit scattered in the docs. We will have a look at Github docs and improve the structure of the docs.

Hi @921kiyo , thanks for the hint. I was interested in what keys such as type, etc. for different file_formats are valid as well. E.g. https://github.com/quantumblacklabs/kedro-examples/blob/master/kedro-exercises/spaceflight/conf/base/catalog.yml :

weather:
  type: spark.SparkDataSet
  filepath: s3a://your_bucket/data/01_raw/weather*
  file_format: csv
  credentials: dev_s3
  load_args:
    header: True
    inferSchema: True
  save_args:
    sep: '|'
    header: True

scooters:
  type: pandas.SQLTableDataSet
  credentials: scooters_credentials
  table_name: scooters
  load_args:
    index_col: ['name']
    columns: ['name', 'gear']
  save_args:
    if_exists: 'replace'
    # if_exists: 'fail'
    # if_exists: 'append'

I totally agree with you that we can improve the docs better, I've logged this ticket and will address it :)
While some keys (type, filepath version) are common for all datasets, some of the arguments (load_args, file_format) are datasets specific. The easiest way to find out which keys are valid would be to find a dataset you are interested in in https://kedro.readthedocs.io/en/latest/kedro.extras.datasets.html and see the constructor arguments (e,g https://kedro.readthedocs.io/en/latest/kedro.extras.datasets.pandas.HDFDataSet.html#kedro.extras.datasets.pandas.HDFDataSet)

@921kiyo Your hints is totally sufficient for me. Thx. No need to extend the docs from my side. :)

Hi @fkromer, what would you think about a jsonschema that helped you validate your config file and autocomplete it (showing you what the required and optional fields are) in your IDE (that hopefully has support for jsonschema)?

@mzjp2 I thought about using strictyaml to validate correctness of yaml files. I already used it a couple of times and it works very well.

I believe this issue should be resolved with the jsonschema now provided within the kedro repository here: https://github.com/quantumblacklabs/kedro/tree/develop/static/jsonschema

They can be read standalone (but they are autogenerated and can be a little tedious to read) or integrated into your editor of choice (there is documentation provided at the latest version of the docs for how to integrate it with VSCode and PyCharm, the instructions for setting it up with any other editor should be similar.

Setup in VSCode was super easy and they work so good, thanks for this feature!

Setup in VSCode was super easy and they work so good, thanks for this feature!

Glad to hear! 🎉 — just a quick note that the links in the snippet for VSCode settings.json is wrong. It had an extra -schema — you can get the right link bu opening the static/jsonschema folder itself on GitHub. I have a PR open to fix this, just waiting on an approval and should be merged in a couple of hours! 🙌

Edit: fixed!

Closing this based on the above comments. Please feel free to re-open or open a new issue if something crops up :)

Was this page helpful?
0 / 5 - 0 ratings