Data.table: Resuming work on dtplyr

Created on 13 Jun 2019 · 10Comments · Source: Rdatatable/data.table

(Copied from email to Matt Dowle, as requested)

I just wanted to give you a heads up that I'm going to be working on https://github.com/hadley/dtplyr over the next couple of weeks. The goal is to make data.table a full dplyr backend, in much the same way as dbplyr, and commit to maintaining it in the long run.

Please let me know if there's anything you'd like me to keep in mind when presenting or describing the relationship between dtplyr and data.table.

question

Source

hadley

👍19 👀4

Most helpful comment

I have a brief summary of my translation efforts so far at http://dtplyr.tidyverse.org/articles/translation.html — I would love any feedback you might have, particularly if there's places where I could generate more efficient or more idiomatic data.table code.

Still on the todo list:

translation for joins
figuring out how to translate quosures
writing wrappers for eager evaluation (removing existing code)
translation for scoped verbs

hadley on 25 Jun 2019

👍6

All 10 comments

Still on the todo list:

translation for joins
figuring out how to translate quosures
writing wrappers for eager evaluation (removing existing code)
translation for scoped verbs

hadley on 25 Jun 2019

👍6

I think dtplyr is now feature complete. This isn't to say that it's finished (I'm sure there are plenty of small bugs remaining!) but it should now be able to translate a wide range of real world dplyr code into (mostly) idiomatic data.table code.

Please try it out and let me know what you think!

hadley on 27 Jun 2019

@hadley I gave it a read and here are some comments:

dt %>% arrange(a, b, c) %>% show_query()
#> `_DT1`[order(a, b, c), ]

I'd say it's more idiomatic to drop the empty j argument, not sure what everyone else thinks.

dt %>% rename(x = a, y = b) %>% show_query()
#> setnames(copy(`_DT1`), c("a", "b"), c("x", "y"))

Pandas is using inplace (default False) to _allow_ in-place operations. Any plans to have optional by-reference behavior in dtplyr? I know you prefer copies to be made but of course there are many use cases where doing so can be performance-catastrophic. The option seems like a good compromise. I see immutable = FALSE mentioned in the article but not in the documentation.

dt %>% transmute(a2 = a * 2, b2 = b * 2, a4 = a2 * 2) %>% show_query()
#> copy(`_DT1`)[, `:=`(a2 = a * 2, b2 = b * 2)][, .(a2, b2, a4 = a2 * 
#>     2)]

A bit surprising to see a copy made for some transmute calls but not others. More idiomatic is to have an extended j expression:

`_DT1`[ , {
  a2 = a*2
  b2 = b*2
  a4 = a2*2
  .(a2 = a2, b2 = b2, a4 = a4)
}]

The primary exception is grouped filter(), which requires the use of .SD ()

Not sure what's supposed to be in ()

dt %>% key_by(a) %>% summarise(b = mean(b)) %>% show_query()
#> setkeyv(copy(`_DT1`), cols = "a")[, .(b = mean(b)), by = .(a)]

The implementation of key_by is strange. keyby in [ sorts only _after_ aggregation; keying before could have a huge performance cost.

Because this does one upfront sort, it should generate more efficient code when performing repeated actions on the same groups.

Upfront sorting should be done with setkey/setindex directly in this case. Perhaps there could be a new argument to arrange?

dt %>% distinct(c = a + b, .keep_all = TRUE) %>% show_query()
#> unique(copy(`_DT1`)[, `:=`(c = a + b)], by = "c")

Another case where copy behavior can change depending on input... can be avoided like (with improvement in the future from https://github.com/Rdatatable/data.table/issues/1269)

`_DT1`[ , TRUE, by = .(c = a+b)][ , V1 := NULL]

Note that filter() and mutate() can’t be combined because dt[a == 1, .(b2 := b * 2)] modifies the selected rows in place.

However, dtplyr does strive to avoid needless copies, so it won’t explicitly copy if there’s already an implicit copy produced by [, head() or similar:

This is a bit confusing. Reads to me as "you can't combine filter+mutate. You can combine filter+mutate". And mutate above already shows a copy being created... so I'm not sure what the first sentence is getting at.

dt %>% semi_join(dt2, by = "a") %>% show_query()
#> `_DT1`[unique(`_DT1`[`_DT2`, which = TRUE, nomatch = 0L, on = .(a)])]

Please switch to nomatch = NULL. We're in the process of migrating the old nomatch behavior to allow for a more generalized nomatch: https://github.com/Rdatatable/data.table/issues/857

I also think it would be helpful to point to data.table's own vignettes as a further reference, e.g. for .SD or setkey.

MichaelChirico on 2 Jul 2019

Thanks @MichaelChirico — that's very useful! A few questions below:

Pandas is using inplace (default False) to allow in-place operations. Any plans to have optional by-reference behavior in dtplyr?

Do you mean for individual verbs? My sense was no, because using immutable = FALSE should give you an adequate level of control. Where else did you expect to see it mentioned in the documentation?

A bit surprising to see a copy made for some transmute calls but not others. More idiomatic is to have an extended j expression:

Ah ok, it should be straightforward to generate that instead. Would you also do the same for a mutate?

`_DT1`[ , {
  a2 = a*2
  b2 = b*2
  a4 = a2*2
  .(a2 := a2, b2 := b2, a4 := a4)
}]

The implementation of key_by is strange. keyby in [ sorts only after aggregation; keying before could have a huge performance cost.

This is because I find it hard to understand exactly how you're supposed to do grouping. I'll change it to just use keyby instead of by at the first aggregation.

Another case where copy behavior can change depending on input... can be avoided like

R `_DT1`[ , TRUE, by = .(c = a+b)][ , V1 := NULL]

I don't see how you do this safely in general. What if there's already an existing V1 variable?

I also think it would be helpful to point to data.table's own vignettes as a further reference, e.g. for .SD or setkey.

Will do.

hadley on 2 Jul 2019

Where else did you expect to see it mentioned in the documentation?

At a glance I didn't see it anywhere in the documentation besides the reference in the article (I checked the website for mutate and transmute)

Would you also do the same for a mutate?

Not quite:

`_DT1`[ , c('a2', 'b2', 'a4') := {
  a2 = a*2
  b2 = b*2
  a4 = a2*2
  .(a2, b2, a4)
}]

I'll change it to just use keyby instead of by at the first aggregation.

Sounds correct. I'm not sure where you're getting tripped up for grouping, do you have an example? when by is present, each j is scoped within a value of the by vector(s) (available as .BY, e.g. .BY$c in the unique example). keyby is essentially just by followed by setkey

I don't see how you do this safely in general

You can get safer:

`_DT1`[ , .('__temp_var__' = TRUE), by = .(c = a+b)][ , '__temp_var__' := NULL]

But much uglier. If you want to file an issue on the dtplyr tracker, we can reference it for follow-up in #1259

Actually, I just noticed you're using .keep.all = TRUE in that example, so the above is actually wrong; this is much nicer (a and b will be lost; they could be retained with .SDcols = names(`_DT1`)):

`_DT1`[ , .SD, by = .(c = a + b)]

And a prettier alternative to the .keep.all = FALSE case would be:

`_DT1`[ , .(c = unique(a + b))]

MichaelChirico on 2 Jul 2019

Ok, I'll work on the suggested translations for compound transmute() + mutate() today.
I basically don't get when I'm supposed to use setkey() vs by vs keyby.
I'll explore your proposed translations for distinct() — thanks!

hadley on 2 Jul 2019

we have the vignette (https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html) which given your PR I assume you read; it would be helpful if you could expand a bit on what's got you tripped up so we could improve that. also the secondary indices vignette (https://cran.r-project.org/web/packages/data.table/vignettes/datatable-secondary-indices-and-auto-indexing.html)

setkey and by/keyby are basically distinct in how you should think of them.

setkey physically reorders your table. as e.g. @BrodieG has been exploring lately on his blog, known-sorted data offers all sorts of potential for efficient computation -- joins can be done with binary search almost instantly, finding indices for grouping is almost instantaneous, etc. in general it's a good idea, if your data has a natural ordering (e.g. by individual ID, time period), to set it as a key.

by is what signals grouped computation (generally that means aggregation, though it can be very nice for more general contexts to have group-level scope).

keyby vs by is just about whether you care if the output is _physically_ sorted (for the same reasons as setkey; often, keyby is used when creating a lookup table to join back, e.g. that's what's done in CJ by default to get "balanced panels")

MichaelChirico on 2 Jul 2019

@MichaelChirico I will try and write something about why I don't find those vignettes particularly illuminating, but I'm not sure if it's just something personal with the way my brain works.

Having read your explanation maybe it would make sense to eliminate key_by() and replace it with an optional key_by argument to lazy_dt() that would setkey() the input data without affecting grouping. To me, that feels the most natural place to put it because you're making an assertion about some property of the input data.

hadley on 2 Jul 2019

@MichaelChirico maybe my problem is made more clear by your explanation: you explained keys exclusively in terms of performance. I am more used to thinking about grouping, which is a property of the desired analysis.

hadley on 2 Jul 2019

Closing this now since my initial work on dtplyr is done and I've advertised it on twitter — comments in the form of dtplyr issues still greatly appreciated.

hadley on 23 Jul 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings