Drake: Uptake of drake in the community

Created on 23 Mar 2018  Â·  7Comments  Â·  Source: ropensci/drake

I would like to learn more about who is using drake, including the number of people and the kinds of projects involved. GitHub metrics and CRAN logs help, but I am looking for a more comprehensive picture. Bonus points for publicly-available real-world projects, especially papers that cite or use drake (re: https://github.com/ropensci/onboarding/issues/156#issuecomment-350419112).

help or input

All 7 comments

I can start!

I’m using one of my GIS projects to test-drive drake. The project’s primary dataset is the 2017 tax assessor data for all parcels in King County, Washington (US). The data are used in a suitability analysis that will identify publicly-owned properties where affordable housing and pre-schools can be developed.

I discovered drake while I was investigating other make alternatives and the repo’s README is what convinced me to give it a try. I knew that many steps in my project would be computationally expensive (e.g., multi-step spatial operations run over ~600k polygons) so the “what’s done stays done” benefit appealed to me.

As I have implemented the data development pipeline I have found that the architecture of a drake project (plan, target, command, etc.) has encouraged me to be think more rigorously about each step in my project. I've also noticed that it incentivizes workflows that are more _modular_; for instance, if I want to filter the dataset using a few different parameters I will record those parameters as a target rather than burying them in a filter() call deep within a pipeline. The combination of rigor and modularity has made it easier to explain the choices I’ve made and to make adjustments based on the feedback I get from the project stakeholders.

Having a drake plan also gives me more confidence that my project will be able to handle a major update to the underlying data or the addition of new suitability criteria. I am always extolling the virtues of reproducibility and transparency to my clients, so this added confidence is a meaningful improvement to my ability to “walk the walk”.

I have enjoyed testing the package and watching it mature. I plan to develop future projects with drake (especially spatial ones) and will keep sending feedback along the way!

I have been using drake to manage a large text analysis project that is not currently public but will be. It involves classification of a large corpus of documents using keras models, and a lot of heavy text downloading and preprocessing using command-line utilities in various languages. Many of the inputs, particularly a large corpora of documents I used to generate training vectors, are big enough that they are slow to download and can't be held in RAM on many machines, at least at the same time. For this reason I primarily have files as intermediate steps. Most of the computationally intense work is already parallelized so drake parallelism isn't used much. I'm using drake as I build up the steps of the project so I can avoid re-running slow parts and preprocessing as I develop later parts of the pipeline, and then I will want to publish the project as a fully reproducible analysis.

I am starting to use drake to manage one of my PhD's project which consists of testing and comparing several different models for species distribution modelling. Each model has to run on different datasets and with different sets of parameters, which means a lot of models and several days to run them.

I started the project without drake, but with drake it really helps getting a clearer picture of what is getting done, and to have a more concise and modular code. Having used a bit make before to manage an R project, I really appreciate the way everything is done within R!

The code is not publicly available now but will be when it is finished.

I just start out using drake for a chapter of my PhD thesis where I want to give an overview over a couple of statistical methods. It's very much work in progress, as I just started to use it a few hours ago. But my goal is to integrate it into my org-mode/R workflow.

I'm currently using drake on a still private project that simulates species to assess the efficiency of using species distribution modelling to estimate species richness. I am using it for reproducibility and parallelism issue mostly (you get a lot of simulations when you have many species, many model types and other parameters to vary). drake handles massive parallelism very nicely. Whenever the manuscript is ready, I'm going to make my drake repo available ;)

I've been using remake for a couple of projects before realizing drake was more updated and about to takeover remake.

I'm struggling with deploying drake on a cluster and/or setting up a distant cluster from a local computer when using drake and can't say if it's because of the cluster architecture, the difficulty of setting drake up or simply me not understanding how things are supposed to work.

That's great to hear, @Rekyt. Please feel free to post a new issue about getting set up with your cluster. I would be happy to help.

Usage is spreading, and it's super exciting to see! I no longer think this needs to be an open issue, but I am continuing to learn where and how people are using drake.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tiernanmartin picture tiernanmartin  Â·  3Comments

billdenney picture billdenney  Â·  9Comments

AlexAxthelm picture AlexAxthelm  Â·  8Comments

htlin picture htlin  Â·  4Comments

matthiasgomolka picture matthiasgomolka  Â·  8Comments