The current (5-28-2017) dev version of dplyr
0.6.0 appears to not allow joins with common column names with the current CRAN version of sparklyr
0.5.5. This means if this version of dplyr
becomes current on CRAN before sparklyr
also updates on CRAN, then production user code will break on bulk update (such as update.packages()
). As a sparklyr
user I would suggest this be treated as an important dependent package (sparklyr
) breaking on dplyr
proposed CRAN update (regardless of the automatic check status of sparklyr
0.5.5).
The problem appears to go away if we move up to the dev version of sparklyr
0.5.5.9000.
I am re-filing the issue as I have improved the reprexes, and tested and documented more combinations of package versions. I am re-filing it here as this issue seems relevant to dplyr
itself (especially as sparklyr
appears to already have a fix that just needs to percolate up to CRAN).
Failing and succeeding reprexes below.
# devtools::install_github("tidyverse/dplyr")
# devtools::install_github('tidyverse/dbplyr')
suppressPackageStartupMessages(library('dplyr'))
library('sparklyr')
sc <- spark_connect(version='2.0.2',
master = "local")
d1 <- copy_to(sc, data.frame(x=1:3, y=4:6), 'd1')
d2 <- copy_to(sc, data.frame(x=1:3, y=7:9), 'd2')
left_join(d1, d2, by='x')
#> Error: Column `y` must have a unique name
# print versions
packageVersion("dplyr")
#> [1] '0.6.0'
packageVersion("sparklyr")
#> [1] '0.5.5'
if(requireNamespace("dbplyr", quietly = TRUE)) {
packageVersion("dbplyr")
}
#> [1] '0.0.0.9001'
R.Version()$version.string
#> [1] "R version 3.4.0 (2017-04-21)"
# cleanup
spark_disconnect(sc)
# devtools::install_github("tidyverse/dplyr")
# devtools::install_github('tidyverse/dbplyr')
suppressPackageStartupMessages(library('dplyr'))
library('sparklyr')
sc <- spark_connect(version='2.0.2',
master = "local")
d1 <- copy_to(sc, data.frame(x=1:3, y=4:6), 'd1')
d2 <- copy_to(sc, data.frame(x=1:3, y=7:9), 'd2')
left_join(d1, d2, by='x')
#> # Source: lazy query [?? x 3]
#> # Database: spark_connection
#> x y.x y.y
#> <int> <int> <int>
#> 1 1 4 7
#> 2 2 5 8
#> 3 3 6 9
# print versions
packageVersion("dplyr")
#> [1] '0.6.0'
packageVersion("sparklyr")
#> [1] '0.5.5.9000'
if(requireNamespace("dbplyr", quietly = TRUE)) {
packageVersion("dbplyr")
}
#> [1] '0.0.0.9001'
R.Version()$version.string
#> [1] "R version 3.4.0 (2017-04-21)"
# cleanup
spark_disconnect(sc)
Thanks for reporting this @JohnMount, really appreciated.
The problem here is that in order to support joins in sparklyr
, sparklyr
had to override dplyr
internals in the previous release, the fix to avoid using USING
is now supported in dplyr
; however, sparklyr
is still overriding the internals and the internals of dplyr
have changed significantly, causing this problem.
I think the best path here is to push a patch for sparklyr
together with the release of dplyr 0.6
. Here is the change https://github.com/rstudio/sparklyr/commit/0c39d2e403a66b6ae3ed563a09003cdbb23ffbeb and the CRAN patch: https://github.com/rstudio/sparklyr/releases/tag/v0.5.6
@JohnMount if you could try out this v0.5.6
patch, this would be very helpful to the community and much appreciated! The fix affects JOINS
only, but is not scoped to only LEFT JOINS
.
@hadley could you ping me on Slack when you submit dplyr 0.6
to CRAN to submit the sparklyr 0.5.6
patch with it?
Thanks @javierluraschi ,
It looks like dplyr
0.7.0
is already up on CRAN, and (as expected) doesn't work with the CRAN 0.5.5
version of Sparklyr
:
suppressPackageStartupMessages(library('dplyr'))
library('sparklyr')
sc <- spark_connect(version='2.0.2',
master = "local")
d1 <- copy_to(sc, data.frame(x=1:3, y=4:6), 'd1')
d2 <- copy_to(sc, data.frame(x=1:3, y=7:9), 'd2')
left_join(d1, d2, by='x')
#> Error: Column `y` must have a unique name
# print versions
packageVersion("dplyr")
#> [1] '0.7.0'
packageVersion("sparklyr")
#> [1] '0.5.5'
if(requireNamespace("dbplyr", quietly = TRUE)) {
packageVersion("dbplyr")
}
#> [1] '1.0.0'
R.Version()$version.string
#> [1] "R version 3.4.0 (2017-04-21)"
# cleanup
spark_disconnect(sc)
devtools::install_github("rstudio/sparklyr")
gives appears to work well (but notice this pulled the version of dbplyr
down):
suppressPackageStartupMessages(library('dplyr'))
library('sparklyr')
sc <- spark_connect(version='2.0.2',
master = "local")
d1 <- copy_to(sc, data.frame(x=1:3, y=4:6), 'd1')
d2 <- copy_to(sc, data.frame(x=1:3, y=7:9), 'd2')
left_join(d1, d2, by='x')
#> # Source: lazy query [?? x 3]
#> # Database: spark_connection
#> x y.x y.y
#> <int> <int> <int>
#> 1 1 4 7
#> 2 2 5 8
#> 3 3 6 9
# print versions
packageVersion("dplyr")
#> [1] '0.7.0'
packageVersion("sparklyr")
#> [1] '0.5.5.9002'
if(requireNamespace("dbplyr", quietly = TRUE)) {
packageVersion("dbplyr")
}
#> [1] '0.0.0.9001'
R.Version()$version.string
#> [1] "R version 3.4.0 (2017-04-21)"
# cleanup
spark_disconnect(sc)
We can probably ask people to "go to the dev version of Sparklyr", but for confidence it would be good to have some assurance that a given tag or branch is stable and exactly what versions of everything is needed. Hopefully CRAN will let you push a Sparklyr
patch quickly (they do do that on occasion if you ask).
sparklyr
fix being submitted to CRAN waiting for response...
@JohnMount on CRAN now.