sdf_bind_rows is filling NaNs in the following example. My guess would be: is some of it is the fact that year is of time int in one column and of type dbl in another. R users would expect type promotion at this point. But also notice the count column is also NaN-out (the input legitimately did have a NaN, but notice how we lost all values).
reprex below.
Also I am seeing a lot of warnings of the form:
Warning message:
Translator is missing window functions:
count, n_distinct
Is there some way to force sparklyr to re-install its local sparks? I am running on MacOS.
suppressPackageStartupMessages(library("dplyr"))
library("sparklyr")
packageVersion("dplyr")
#> [1] '0.7.1.9000'
packageVersion("sparklyr")
#> [1] '0.5.6.9008'
my_db <- sparklyr::spark_connect(version='2.0.2',
master = "local")
a <- sparklyr::spark_read_parquet(my_db, 'a', '~/data/a')
b <- sparklyr::spark_read_parquet(my_db, 'b', '~/data/b')
print(a)
#> # Source: table<a> [?? x 3]
#> # Database: spark_connection
#> year count name
#> <dbl> <dbl> <chr>
#> 1 2005 6 a
#> 2 2007 1 b
#> 3 2010 NaN c
print(b, n=100)
#> # Source: table<b> [?? x 3]
#> # Database: spark_connection
#> year name count
#> <int> <chr> <dbl>
#> 1 2006 a 0
#> 2 2007 a 0
#> 3 2008 a 0
#> 4 2009 a 0
#> 5 2010 a 0
#> 6 2005 b 0
#> 7 2006 b 0
#> 8 2008 b 0
#> 9 2009 b 0
#> 10 2010 b 0
#> 11 2005 c 0
#> 12 2006 c 0
#> 13 2007 c 0
#> 14 2008 c 0
#> 15 2009 c 0
#> 16 2005 d 0
#> 17 2006 d 0
#> 18 2007 d 0
#> 19 2008 d 0
#> 20 2009 d 0
#> 21 2010 d 0
v <- sparklyr::sdf_bind_rows(list(a,b))
print(v, n=1000)
#> # Source: table<sparklyr_tmp_be3445305142> [?? x 3]
#> # Database: spark_connection
#> year count name
#> <dbl> <dbl> <chr>
#> 1 NaN NaN a
#> 2 NaN NaN b
#> 3 NaN NaN c
#> 4 NaN NaN a
#> 5 NaN NaN a
#> 6 NaN NaN a
#> 7 NaN NaN a
#> 8 NaN NaN a
#> 9 NaN NaN b
#> 10 NaN NaN b
#> 11 NaN NaN b
#> 12 NaN NaN b
#> 13 NaN NaN b
#> 14 NaN NaN c
#> 15 NaN NaN c
#> 16 NaN NaN c
#> 17 NaN NaN c
#> 18 NaN NaN c
#> 19 NaN NaN d
#> 20 NaN NaN d
#> 21 NaN NaN d
#> 22 NaN NaN d
#> 23 NaN NaN d
#> 24 NaN NaN d
Attaching zip file of data.
data.zip
Looks like a bug. I'll look into it. For the time being, if your tables have the same column names, rbind() seems to be OK.
BTW the latest devel should have the dbplyr warnings fixed
Fix looks good to me. Thank you very much!
Most helpful comment
BTW the latest devel should have the
dbplyrwarnings fixed