Dplyr: Consider adding support for `CROSS JOIN`

Created on 28 Jun 2017  路  6Comments  路  Source: tidyverse/dplyr

CROSS JOIN is unsupported under full_join.

memdb

d1_mem <- memdb_frame(a = 1:5, .name = "d1")
d2_mem <- memdb_frame(b = 1:5, .name = "d2")
d1_mem %>% full_join(d2_mem)

Triggers:

Error: `by` required, because the data sources have no common variables

However,

tbl(src_memdb(), sql("SELECT * FROM d1 CROSS JOIN d2"))
# Source:   SQL [?? x 2]
# Database: sqlite 3.11.1 [:memory:]
       a `a:1`
   <int> <int>
 1     1     1
 2     1     2
 3     1     3
 4     1     4
 5     1     5
 6     2     1
 7     2     2
 8     2     3
 9     2     4
10     2     5
# ... with more rows

sparklyr

library(sparklyr)
library(dplyr)

spark_install(version = "2.1.0")
sc <- spark_connect(master = "local", version = "2.1.0")

d1_tbl <- copy_to(sc, data.frame(a = 1:5), name = "d1", overwrite = TRUE)
d2_tbl <- copy_to(sc, data.frame(b = 1:5), name = "d2", overwrite = TRUE)

d1_tbl %>% full_join(d2_tbl)

Triggers,

Error: `by` required, because the data sources have no common variables

See https://github.com/rstudio/sparklyr/issues/771

feature wip

Most helpful comment

@ianmcook Yeah, I agree, implementing cross_join() would be a better approach. We could also add a better error in full_join() for this case:

"No overlapping variables to perform full_join(), use cross_join() instead."

However, if we go this route, I think we also need to implement cross_join for DataFrames; would be inconsistent to add a new function that is supported only in database backends.

Is someone willing/able/interested in implementing cross_join for DataFrames already? If not, I might try to give it a shot in a week or two.

All 6 comments

Wouldn't it be better to add a new cross_join() two-table verb instead of implementing this through full_join()? An accidental Cartesian join from failure to properly specify join conditions can return huge amounts of data.

@ianmcook Yeah, I agree, implementing cross_join() would be a better approach. We could also add a better error in full_join() for this case:

"No overlapping variables to perform full_join(), use cross_join() instead."

However, if we go this route, I think we also need to implement cross_join for DataFrames; would be inconsistent to add a new function that is supported only in database backends.

Is someone willing/able/interested in implementing cross_join for DataFrames already? If not, I might try to give it a shot in a week or two.

I like the idea of a cross_join() verb in addition to the strict full_join verb (it documents intent). Also couldn't dplyr have a nice "in terms of other verbs" default implementation such as the following? Then only backends with additional issues have to override the implementation, and we have the implementation in terms of already trusted verbs.

suppressPackageStartupMessages(library(dplyr))
a <- data.frame(x = 1:3)
b <- data.frame(y= 1:2)

full_join(mutate(a, FAKECOL=1), mutate(b, FAKECOL=1), by='FAKECOL') %>% 
  select(-FAKECOL)
#>   x y
#> 1 1 1
#> 2 1 2
#> 3 2 1
#> 4 2 2
#> 5 3 1
#> 6 3 2

Obviously there are a few more steps to pick a fake join condition column at run time. The method I have been using to generate names is:

setdiff(paste("FC", 
              seq_len(1 + length(c(colnames(a), colnames(b)))), 
              sep = '_'), 
        c(colnames(a), colnames(b)))[[1]]

Part of more general #2240

Will this also work in dplyr or only in dbplyr?

This is just a translation of dplyr's full_join() for SQL queries as CROSS JOIN. You can see the changes here: https://github.com/tidyverse/dbplyr/commit/331646383d7e1ae574b161b6a16a1d1d0582af54

Was this page helpful?
0 / 5 - 0 ratings