Drake: Best way to include SQL code in drake?

Created on 20 Sep 2018  路  6Comments  路  Source: ropensci/drake

Hi there,

I am looked through the existing issues and haven't found anything that directly relates to my question. Apologies if I have missed something obvious.

I am trying to determine what is the best way to include sql code in a drake workplan and interact with a database. I keep getting the "the use of single-quotes to declare files is deprecated" warning and I am wondering if there is a better approach to what I am currently doing. In this instance I need the single quotes because I have already wrapped the entirety of the sql in double quotes.

Thanks for making such an awesome package! Reprex below:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(drake)
library(nycflights13)


## Make database
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":dbname:")


copy_to(con, nycflights13::flights, "flights",
        temporary = FALSE, 
        indexes = list(
          c("year", "month", "day"), 
          "carrier", 
          "tailnum",
          "dest"
        )
)

## Just sql
plan1 <- drake_plan(strings_in_dots = "literals", 
  flights = tbl(con, sql("SELECT carrier, sched_dep_time FROM flights WHERE (carrier IN ('UA', 'AA'))")) %>% 
    collect(),

  summary = flights %>% 
    group_by(carrier) %>% 
    summarise(mean_dep_time = mean(sched_dep_time))

)

config <- drake_config(plan1)
#> Warning: Files in a command declared with single-quotes:
#>   UA
#>   AA
#> The use of single-quotes to declare files is deprecated. Use file_in(), file_out(), and knitr_in() in your commands. See `?drake_plan` for examples.

## Trying to shoe horn into file_in
plan2 <- drake_plan(strings_in_dots = "literals", 
                    flights = tbl(con, file_in(sql("SELECT carrier, sched_dep_time FROM flights WHERE (carrier IN ('UA', 'AA'))"))) %>% 
                      collect(),

                    summary = flights %>% 
                      group_by(carrier) %>% 
                      summarise(mean_dep_time = mean(sched_dep_time))

)

config <- drake_config(plan2)
#> Warning: Files in a command declared with single-quotes:
#>   UA
#>   AA
#> The use of single-quotes to declare files is deprecated. Use file_in(), file_out(), and knitr_in() in your commands. See `?drake_plan` for examples.

Created on 2018-09-19 by the reprex package (v0.2.1)

tl;dr Make "The use of single-quotes to declare files is deprecated." go away when using sql code directly in R.

use case

All 6 comments

Thanks for the question. I am out of town right now, and I will respond after my return on Oct 2. In the meantime, @AlexAxthelm has done several projects using drake + SQL.

Hi, I've done a few projects along these lines over the past year or two, and I'm working on getting something together (when I eventually get free time) for the drake book

For the short term workaround, the quick solutions that I have used in the past are to either escape those single quotes \', or (my preference) write the query in a separate *.SQL file, and use read_lines() or a similar function to bring in the query as a string object. That way drake doesn't actually see the single quotes, since they aren't part of the plan definition. Using the separate files also allows you to run the query from outside R, which is often useful if you're working with other non-R users.

I hope this helps you. If you want, I can add some more detail, but that probably won't be able to happen until next week.

I really like this idea both here in drake but also as general means of working with SQL and R code at the same time. No hurry but I am curious where in a drake plan you'd add these SQL/string objects as well as where you add the dB connection. So yes if you have more details it would be much apppreciated. 馃檹

Right, I'll make a little repo later as an example, but the dbConnection was definitely the trickiest part to figure out for me. The most success that I've had was defining my connection as a function, so that each parallel drake instance can call the function and set up a new connection, and tear it down when the object is finished building. Also, I use dbplyr a lot rather than building queries by hand. However, my main database is still on SQL Server 2008, so I can't use lag or anything else that dbplyr translates to a window function, so I still write some queries as .sql files.

I don't have a good example right now, but if usually goes something like this:

library("drake")
library("tidyverse")
library("DBI")
library("dbplyr")

dbcon <- function(){
  DBI::dbConnect(<<CONNECTION INFO>>)
}

dbplyr_query <- function(con){
  tbl(con, "mtcars") %>%
    group_by(cyl) %>%
    summarize(mpg = mean(mpg))
}

my_plan <- drake_plan(
  mpgs = dbplyr_query(dbcon()),
  file_text = readLines(file_in("query.SQL")),
  query_set = dbGetQuery(dbcon(), file_text)
  )

Sorry if there are syntax errors in here. I'm not at a machine where I can actually run this against a DB right now.

I also suspect that warn_single_quoted_files() is called incorrectly. I will have to check when I return.

To make sure single quotes do not cause erroneous file detection, you can call pkgconfig::set_config("drake::strings_in_dots" = "literals") before make(). The following (slightly modified) version of your code appears to work just fine.

library(dplyr)
library(drake)
library(nycflights13)

# So you can use single quotes in SQL calls:
pkgconfig::set_config("drake::strings_in_dots" = "literals")

con <- DBI::dbConnect(RSQLite::SQLite(), path = ":dbname:")

copy_to(
  con,
  nycflights13::flights,
  "flights",
  temporary = FALSE,
  indexes = list(
    c("year", "month", "day"),
    "carrier",
    "tailnum",
    "dest"
  )
)

# No need to set strings_in_dots if you used pkgconfig::set_config() above.
plan1 <- drake_plan(
  flights = tbl(
    con,
    sql("SELECT carrier, sched_dep_time FROM flights WHERE (carrier IN ('UA', 'AA'))")
  ) %>%
    collect(),
  summary = flights %>%
    group_by(carrier) %>%
    summarise(mean_dep_time = mean(sched_dep_time))
)

make(plan1)
#> target flights
#> target summary
make(plan1)
#> All targets are already up to date.
readd(summary)
#> # A tibble: 2 x 2
#>   carrier mean_dep_time
#>   <chr>           <dbl>
#> 1 AA              1290.
#> 2 UA              1313.
Was this page helpful?
0 / 5 - 0 ratings

Related issues

AlexAxthelm picture AlexAxthelm  路  8Comments

wlandau picture wlandau  路  9Comments

tiernanmartin picture tiernanmartin  路  3Comments

wlandau picture wlandau  路  4Comments

rsangole picture rsangole  路  7Comments