Dplyr: Consider adding support for `CROSS JOIN`

Created on 28 Jun 2017 · 6Comments · Source: tidyverse/dplyr

CROSS JOIN is unsupported under full_join.

memdb

d1_mem <- memdb_frame(a = 1:5, .name = "d1")
d2_mem <- memdb_frame(b = 1:5, .name = "d2")
d1_mem %>% full_join(d2_mem)

Triggers:

Error: `by` required, because the data sources have no common variables

However,

tbl(src_memdb(), sql("SELECT * FROM d1 CROSS JOIN d2"))

# Source:   SQL [?? x 2]
# Database: sqlite 3.11.1 [:memory:]
       a `a:1`
   <int> <int>
 1     1     1
 2     1     2
 3     1     3
 4     1     4
 5     1     5
 6     2     1
 7     2     2
 8     2     3
 9     2     4
10     2     5
# ... with more rows

sparklyr

library(sparklyr)
library(dplyr)

spark_install(version = "2.1.0")
sc <- spark_connect(master = "local", version = "2.1.0")

d1_tbl <- copy_to(sc, data.frame(a = 1:5), name = "d1", overwrite = TRUE)
d2_tbl <- copy_to(sc, data.frame(b = 1:5), name = "d2", overwrite = TRUE)

d1_tbl %>% full_join(d2_tbl)

Triggers,

Error: `by` required, because the data sources have no common variables

See https://github.com/rstudio/sparklyr/issues/771

feature wip

Source

javierluraschi

Most helpful comment

@ianmcook Yeah, I agree, implementing cross_join() would be a better approach. We could also add a better error in full_join() for this case:

"No overlapping variables to perform full_join(), use cross_join() instead."

However, if we go this route, I think we also need to implement cross_join for DataFrames; would be inconsistent to add a new function that is supported only in database backends.

Is someone willing/able/interested in implementing cross_join for DataFrames already? If not, I might try to give it a shot in a week or two.

javierluraschi on 29 Jun 2017

👍2

All 6 comments

Wouldn't it be better to add a new cross_join() two-table verb instead of implementing this through full_join()? An accidental Cartesian join from failure to properly specify join conditions can return huge amounts of data.

ianmcook on 28 Jun 2017

@ianmcook Yeah, I agree, implementing cross_join() would be a better approach. We could also add a better error in full_join() for this case:

"No overlapping variables to perform full_join(), use cross_join() instead."

However, if we go this route, I think we also need to implement cross_join for DataFrames; would be inconsistent to add a new function that is supported only in database backends.

Is someone willing/able/interested in implementing cross_join for DataFrames already? If not, I might try to give it a shot in a week or two.

javierluraschi on 29 Jun 2017

👍2

I like the idea of a cross_join() verb in addition to the strict full_join verb (it documents intent). Also couldn't dplyr have a nice "in terms of other verbs" default implementation such as the following? Then only backends with additional issues have to override the implementation, and we have the implementation in terms of already trusted verbs.

suppressPackageStartupMessages(library(dplyr))
a <- data.frame(x = 1:3)
b <- data.frame(y= 1:2)

full_join(mutate(a, FAKECOL=1), mutate(b, FAKECOL=1), by='FAKECOL') %>% 
  select(-FAKECOL)
#>   x y
#> 1 1 1
#> 2 1 2
#> 3 2 1
#> 4 2 2
#> 5 3 1
#> 6 3 2

Obviously there are a few more steps to pick a fake join condition column at run time. The method I have been using to generate names is:

setdiff(paste("FC", 
              seq_len(1 + length(c(colnames(a), colnames(b)))), 
              sep = '_'), 
        c(colnames(a), colnames(b)))[[1]]