Hi there, is it possible to define a table through a database connection as dependency within a target? I could not find information on that an the following did not work for me (maybe obviously):
... preprocess = my_preprocessing(file_in(DBI::dbReadTable(con, "my_table"))) ...
The goal would be to track changes in the database.
Thank you for any information on that.
As dependencies, connection objects are brittle because they are transient and session-specific (related: #345 and #1038). file_in() is not designed to work here because it [requires you to insert the literal file path]((https://ropenscilabs.github.io/drake-manual/plans.html#limitations-of-file-tracking). for the general case. I recommend trigger(condition = TRUE) to make sure you always get the latest data.
plan <- drake_plan(
data = target(
get_data_from_db("my_table"), # define yourself
trigger = trigger(condition = TRUE) # always runs
),
preprocess = my_preprocessing(data) # runs when the data change
)
To avoid duplicating data, you may be able to track a file hash or a modification time instead. It depends on your particular use case. If you mention a connection object in a command, be sure to ignore() it because we expect it to change from session to session anyway.
library(DBI)
con <- dbConnect(...)
plan <- drake_plan(
data = target(
dbReadTable(ignore(con), "my_table"), # Do not rebuild data if con changes.
trigger = trigger(change = somehow_get_db_timestamp()) # Probably varies case by case.
),
preprocess = my_preprocessing(data) # runs when the data change
)
I am not sure if is a universal and inexpensive way to tell when a database changes. (@edgararuiz and @krlmlr, do you know?) If it exists, I will think about storing and tracking connection objects properly. Until then, I think these workarounds are the best we can do.
Most helpful comment
As dependencies, connection objects are brittle because they are transient and session-specific (related: #345 and #1038).
file_in()is not designed to work here because it [requires you to insert the literal file path]((https://ropenscilabs.github.io/drake-manual/plans.html#limitations-of-file-tracking). for the general case. I recommendtrigger(condition = TRUE)to make sure you always get the latest data.To avoid duplicating data, you may be able to track a file hash or a modification time instead. It depends on your particular use case. If you mention a connection object in a command, be sure to
ignore()it because we expect it to change from session to session anyway.I am not sure if is a universal and inexpensive way to tell when a database changes. (@edgararuiz and @krlmlr, do you know?) If it exists, I will think about storing and tracking connection objects properly. Until then, I think these workarounds are the best we can do.