Drake: Database connection as dependency of a target?

Created on 4 Nov 2019  路  2Comments  路  Source: ropensci/drake

Hi there, is it possible to define a table through a database connection as dependency within a target? I could not find information on that an the following did not work for me (maybe obviously):

... preprocess = my_preprocessing(file_in(DBI::dbReadTable(con, "my_table"))) ...

The goal would be to track changes in the database.
Thank you for any information on that.

question

Most helpful comment

As dependencies, connection objects are brittle because they are transient and session-specific (related: #345 and #1038). file_in() is not designed to work here because it [requires you to insert the literal file path]((https://ropenscilabs.github.io/drake-manual/plans.html#limitations-of-file-tracking). for the general case. I recommend trigger(condition = TRUE) to make sure you always get the latest data.

plan <- drake_plan(
  data = target(
     get_data_from_db("my_table"), # define yourself
     trigger = trigger(condition = TRUE) # always runs
  ),
  preprocess = my_preprocessing(data) # runs when the data change
)

To avoid duplicating data, you may be able to track a file hash or a modification time instead. It depends on your particular use case. If you mention a connection object in a command, be sure to ignore() it because we expect it to change from session to session anyway.

library(DBI)
con <-  dbConnect(...)

plan <- drake_plan(
  data = target(
     dbReadTable(ignore(con), "my_table"), # Do not rebuild data if con changes.
     trigger = trigger(change = somehow_get_db_timestamp()) # Probably varies case by case.
  ),
  preprocess = my_preprocessing(data) # runs when the data change
)

I am not sure if is a universal and inexpensive way to tell when a database changes. (@edgararuiz and @krlmlr, do you know?) If it exists, I will think about storing and tracking connection objects properly. Until then, I think these workarounds are the best we can do.

All 2 comments

As dependencies, connection objects are brittle because they are transient and session-specific (related: #345 and #1038). file_in() is not designed to work here because it [requires you to insert the literal file path]((https://ropenscilabs.github.io/drake-manual/plans.html#limitations-of-file-tracking). for the general case. I recommend trigger(condition = TRUE) to make sure you always get the latest data.

plan <- drake_plan(
  data = target(
     get_data_from_db("my_table"), # define yourself
     trigger = trigger(condition = TRUE) # always runs
  ),
  preprocess = my_preprocessing(data) # runs when the data change
)

To avoid duplicating data, you may be able to track a file hash or a modification time instead. It depends on your particular use case. If you mention a connection object in a command, be sure to ignore() it because we expect it to change from session to session anyway.

library(DBI)
con <-  dbConnect(...)

plan <- drake_plan(
  data = target(
     dbReadTable(ignore(con), "my_table"), # Do not rebuild data if con changes.
     trigger = trigger(change = somehow_get_db_timestamp()) # Probably varies case by case.
  ),
  preprocess = my_preprocessing(data) # runs when the data change
)

I am not sure if is a universal and inexpensive way to tell when a database changes. (@edgararuiz and @krlmlr, do you know?) If it exists, I will think about storing and tracking connection objects properly. Until then, I think these workarounds are the best we can do.

Was this page helpful?
0 / 5 - 0 ratings