sdf_bind_rows is filling NaNs in the following example. My guess would be: is some of it is the fact that year
is of time int
in one column and of type dbl
in another. R
users would expect type promotion at this point. But also notice the count
column is also NaN
-out (the input legitimately did have a NaN
, but notice how we lost all values).
reprex
below.
Also I am seeing a lot of warnings of the form:
Warning message:
Translator is missing window functions:
count, n_distinct
Is there some way to force sparklyr
to re-install its local
sparks? I am running on MacOS
.
suppressPackageStartupMessages(library("dplyr"))
library("sparklyr")
packageVersion("dplyr")
#> [1] '0.7.1.9000'
packageVersion("sparklyr")
#> [1] '0.5.6.9008'
my_db <- sparklyr::spark_connect(version='2.0.2',
master = "local")
a <- sparklyr::spark_read_parquet(my_db, 'a', '~/data/a')
b <- sparklyr::spark_read_parquet(my_db, 'b', '~/data/b')
print(a)
#> # Source: table<a> [?? x 3]
#> # Database: spark_connection
#> year count name
#> <dbl> <dbl> <chr>
#> 1 2005 6 a
#> 2 2007 1 b
#> 3 2010 NaN c
print(b, n=100)
#> # Source: table<b> [?? x 3]
#> # Database: spark_connection
#> year name count
#> <int> <chr> <dbl>
#> 1 2006 a 0
#> 2 2007 a 0
#> 3 2008 a 0
#> 4 2009 a 0
#> 5 2010 a 0
#> 6 2005 b 0
#> 7 2006 b 0
#> 8 2008 b 0
#> 9 2009 b 0
#> 10 2010 b 0
#> 11 2005 c 0
#> 12 2006 c 0
#> 13 2007 c 0
#> 14 2008 c 0
#> 15 2009 c 0
#> 16 2005 d 0
#> 17 2006 d 0
#> 18 2007 d 0
#> 19 2008 d 0
#> 20 2009 d 0
#> 21 2010 d 0
v <- sparklyr::sdf_bind_rows(list(a,b))
print(v, n=1000)
#> # Source: table<sparklyr_tmp_be3445305142> [?? x 3]
#> # Database: spark_connection
#> year count name
#> <dbl> <dbl> <chr>
#> 1 NaN NaN a
#> 2 NaN NaN b
#> 3 NaN NaN c
#> 4 NaN NaN a
#> 5 NaN NaN a
#> 6 NaN NaN a
#> 7 NaN NaN a
#> 8 NaN NaN a
#> 9 NaN NaN b
#> 10 NaN NaN b
#> 11 NaN NaN b
#> 12 NaN NaN b
#> 13 NaN NaN b
#> 14 NaN NaN c
#> 15 NaN NaN c
#> 16 NaN NaN c
#> 17 NaN NaN c
#> 18 NaN NaN c
#> 19 NaN NaN d
#> 20 NaN NaN d
#> 21 NaN NaN d
#> 22 NaN NaN d
#> 23 NaN NaN d
#> 24 NaN NaN d
Attaching zip file of data.
data.zip
Looks like a bug. I'll look into it. For the time being, if your tables have the same column names, rbind()
seems to be OK.
BTW the latest devel should have the dbplyr
warnings fixed
Fix looks good to me. Thank you very much!
Most helpful comment
BTW the latest devel should have the
dbplyr
warnings fixed