dplyr 0.6.0 join problem with CRAN version of sparklyr 0.5.5

Created on 29 May 2017  路  4Comments  路  Source: tidyverse/dplyr

The current (5-28-2017) dev version of dplyr 0.6.0 appears to not allow joins with common column names with the current CRAN version of sparklyr 0.5.5. This means if this version of dplyr becomes current on CRAN before sparklyr also updates on CRAN, then production user code will break on bulk update (such as update.packages()). As a sparklyr user I would suggest this be treated as an important dependent package (sparklyr) breaking on dplyr proposed CRAN update (regardless of the automatic check status of sparklyr 0.5.5).

The problem appears to go away if we move up to the dev version of sparklyr 0.5.5.9000.

I am re-filing the issue as I have improved the reprexes, and tested and documented more combinations of package versions. I am re-filing it here as this issue seems relevant to dplyr itself (especially as sparklyr appears to already have a fix that just needs to percolate up to CRAN).

Failing and succeeding reprexes below.

# devtools::install_github("tidyverse/dplyr")
# devtools::install_github('tidyverse/dbplyr')
suppressPackageStartupMessages(library('dplyr'))
library('sparklyr')

sc <- spark_connect(version='2.0.2', 
                              master = "local")
d1 <- copy_to(sc, data.frame(x=1:3, y=4:6), 'd1')
d2 <- copy_to(sc, data.frame(x=1:3, y=7:9), 'd2')

left_join(d1, d2, by='x')
#> Error: Column `y` must have a unique name

# print versions
packageVersion("dplyr")
#> [1] '0.6.0'
packageVersion("sparklyr")
#> [1] '0.5.5'
if(requireNamespace("dbplyr", quietly = TRUE)) {
  packageVersion("dbplyr")
}
#> [1] '0.0.0.9001'
R.Version()$version.string
#> [1] "R version 3.4.0 (2017-04-21)"

# cleanup 
spark_disconnect(sc)
# devtools::install_github("tidyverse/dplyr")
# devtools::install_github('tidyverse/dbplyr')
suppressPackageStartupMessages(library('dplyr'))
library('sparklyr')

sc <- spark_connect(version='2.0.2', 
                              master = "local")
d1 <- copy_to(sc, data.frame(x=1:3, y=4:6), 'd1')
d2 <- copy_to(sc, data.frame(x=1:3, y=7:9), 'd2')

left_join(d1, d2, by='x')
#> # Source:   lazy query [?? x 3]
#> # Database: spark_connection
#>       x   y.x   y.y
#>   <int> <int> <int>
#> 1     1     4     7
#> 2     2     5     8
#> 3     3     6     9

# print versions
packageVersion("dplyr")
#> [1] '0.6.0'
packageVersion("sparklyr")
#> [1] '0.5.5.9000'
if(requireNamespace("dbplyr", quietly = TRUE)) {
  packageVersion("dbplyr")
}
#> [1] '0.0.0.9001'
R.Version()$version.string
#> [1] "R version 3.4.0 (2017-04-21)"

# cleanup 
spark_disconnect(sc)

All 4 comments

Thanks for reporting this @JohnMount, really appreciated.

The problem here is that in order to support joins in sparklyr, sparklyr had to override dplyr internals in the previous release, the fix to avoid using USING is now supported in dplyr; however, sparklyr is still overriding the internals and the internals of dplyr have changed significantly, causing this problem.

I think the best path here is to push a patch for sparklyr together with the release of dplyr 0.6. Here is the change https://github.com/rstudio/sparklyr/commit/0c39d2e403a66b6ae3ed563a09003cdbb23ffbeb and the CRAN patch: https://github.com/rstudio/sparklyr/releases/tag/v0.5.6

@JohnMount if you could try out this v0.5.6 patch, this would be very helpful to the community and much appreciated! The fix affects JOINS only, but is not scoped to only LEFT JOINS.

@hadley could you ping me on Slack when you submit dplyr 0.6 to CRAN to submit the sparklyr 0.5.6 patch with it?

Thanks @javierluraschi ,

It looks like dplyr 0.7.0 is already up on CRAN, and (as expected) doesn't work with the CRAN 0.5.5 version of Sparklyr:

suppressPackageStartupMessages(library('dplyr'))
library('sparklyr')

sc <- spark_connect(version='2.0.2', 
                    master = "local")
d1 <- copy_to(sc, data.frame(x=1:3, y=4:6), 'd1')
d2 <- copy_to(sc, data.frame(x=1:3, y=7:9), 'd2')

left_join(d1, d2, by='x')
#> Error: Column `y` must have a unique name


# print versions
packageVersion("dplyr")
#> [1] '0.7.0'

packageVersion("sparklyr")
#> [1] '0.5.5'

if(requireNamespace("dbplyr", quietly = TRUE)) {
  packageVersion("dbplyr")
}
#> [1] '1.0.0'

R.Version()$version.string
#> [1] "R version 3.4.0 (2017-04-21)"

# cleanup 
spark_disconnect(sc)

devtools::install_github("rstudio/sparklyr") gives appears to work well (but notice this pulled the version of dbplyr down):

suppressPackageStartupMessages(library('dplyr'))
library('sparklyr')

sc <- spark_connect(version='2.0.2', 
                    master = "local")
d1 <- copy_to(sc, data.frame(x=1:3, y=4:6), 'd1')
d2 <- copy_to(sc, data.frame(x=1:3, y=7:9), 'd2')

left_join(d1, d2, by='x')
#> # Source:   lazy query [?? x 3]
#> # Database: spark_connection
#>       x   y.x   y.y
#>   <int> <int> <int>
#> 1     1     4     7
#> 2     2     5     8
#> 3     3     6     9


# print versions
packageVersion("dplyr")
#> [1] '0.7.0'

packageVersion("sparklyr")
#> [1] '0.5.5.9002'

if(requireNamespace("dbplyr", quietly = TRUE)) {
  packageVersion("dbplyr")
}
#> [1] '0.0.0.9001'

R.Version()$version.string
#> [1] "R version 3.4.0 (2017-04-21)"

# cleanup 
spark_disconnect(sc)

We can probably ask people to "go to the dev version of Sparklyr", but for confidence it would be good to have some assurance that a given tag or branch is stable and exactly what versions of everything is needed. Hopefully CRAN will let you push a Sparklyr patch quickly (they do do that on occasion if you ask).

sparklyr fix being submitted to CRAN waiting for response...

@JohnMount on CRAN now.

Was this page helpful?
0 / 5 - 0 ratings