Kedro: Question: how to access credentials inside a node?

Created on 20 Oct 2020  路  7Comments  路  Source: quantumblacklabs/kedro

What are you trying to do?

I am trying to perform ETL operations inside a node. To do this, I need access to database credentials.

Workaround

As a workaround, I can add sqlalchemy engine to DataCatalog in register_data_catalog hook dynamically, but I don't think DataCatalog should be used in this way.

        catalog.add_feed_dict(
            {"mssql_engine": MemoryDataSet(self._mssql_engine, copy_mode='assign')}
        )
Question

Most helpful comment

I have a similar usecase as the one mentioned by @bensdm. I also could not find a built-in way of getting this data from the credentials.yml. For now I simply wrap my credential loading in a node and pass the resulting dict as an input to wherever they're needed. Something along the lines of:

# credentials.yml
# (...)
app:
    client_id: abc
    client_secret: xyz

```python

nodes.py

import yaml

(...)

def get_app_credentials():
with open("./conf/local/credentials.yml") as cred:
cred_dict = yaml.safe_load(cred).get("app")
return cred_dict

def authenticate_user(credentials):
# (...)
return something

```python
# pipeline.py
# (...)
def create_pipeline(**kwargs):
    return Pipeline(
        [
            node(
                func=get_app_credentials,
                inputs=None,
                outputs="credentials"
            ),
            node(
                func=authenticate_user,
                inputs="credentials",
                outputs="something",
            ),
            # (...)
        ]
    )

Any thoughts or comments on whether this is an appropriate work-around or not are very much welcome!

All 7 comments

Hi @mnowotnik, thank you for your question. Could you explain a bit more why your ETL operations, which interface with a DB, can't use a dataset?

interested by the answer too, what if i need credentials to use let's say a google API inside a node? how can I pass it as a param?

Hi @mnowotnik, thank you for your question. Could you explain a bit more why your ETL operations, which interface with a DB, can't use a dataset?

Thanks for taking interest in my concern @limdauto .
I need, among other things, to run a procedure in an MySQL database to prepare data before it can be loaded as a dataset. Since Kedro does not shy away from both read and write operations to external sources, I assume my use case is not entirely outside of conceptual target use of Kedro.

Moreover, I want to execute this operation specifically in the scope of a Node, as opposed to in e.g. _load method of a custom dataset impl, to leverage hook mechanism for easy task tracking.

I have a similar usecase as the one mentioned by @bensdm. I also could not find a built-in way of getting this data from the credentials.yml. For now I simply wrap my credential loading in a node and pass the resulting dict as an input to wherever they're needed. Something along the lines of:

# credentials.yml
# (...)
app:
    client_id: abc
    client_secret: xyz

```python

nodes.py

import yaml

(...)

def get_app_credentials():
with open("./conf/local/credentials.yml") as cred:
cred_dict = yaml.safe_load(cred).get("app")
return cred_dict

def authenticate_user(credentials):
# (...)
return something

```python
# pipeline.py
# (...)
def create_pipeline(**kwargs):
    return Pipeline(
        [
            node(
                func=get_app_credentials,
                inputs=None,
                outputs="credentials"
            ),
            node(
                func=authenticate_user,
                inputs="credentials",
                outputs="something",
            ),
            # (...)
        ]
    )

Any thoughts or comments on whether this is an appropriate work-around or not are very much welcome!

I believe the proper way to do this is to implement or extend a custom DataSet, such as APIDataSet, but I can agree with the OP that there are other use cases, like @bensdm mentioned above, or to manage a session. Not to mentioned, it just more convenient.

you could also load the credentials inside the ProjectHooks and then set them as environment variables.

@bensdm

To run a procedure in an MySQL database to prepare data before it can be loaded as a dataset

I think the idiomatic way is to have a node called prepare_data and execute the procedure in MySQL through a dataset. The node can return a status code as its output for task tracking purpose or some custom string like the output table name that you want to use in the next node.

Having said that, if you still want to access credentials from credentials.yaml in a node, generally you would need a ConfigLoader instance if you don't want to hard code the path to the credentials file like in @atgmello's workaround. In Kedro 0.17, you can retrieve the current Kedro session with get_current_session and retrieve the config_loader instance from there:

from kedro.framework.session import get_current_session
session = get_current_session()
context = session.load_context()
credentials = context._get_config_credentials()

# or credentials = context.config_loader.get("credentials*", "credentials*/**", "**/credentials*")

But with great power comes great responsibility here. Coupling your node with the global session is only intended to be used sparingly.

The last workaround is instead of using credentials, you can use parameters instead. For example:

$ kedro run --params api_token=<my-api-token>

And you get access to that in the node through the params:api_token input. Or you can also inject this value into parameters.yaml through an environment variable. I have written a tutorial here on the injection of env var into your configuration: https://kedrozerotohero.com/programming-patterns/how-to-inject-secrets-into-your-kedro-configuration

Hope this helps!

Since there are a number of alternatives to accomplish what you are after, I will close this issue but please feel free to re-open it if you need more support.

Was this page helpful?
0 / 5 - 0 ratings