Sparklyr: sdf_bind_rows filling in NaNs

Created on 29 Jun 2017  路  4Comments  路  Source: sparklyr/sparklyr

sdf_bind_rows is filling NaNs in the following example. My guess would be: is some of it is the fact that year is of time int in one column and of type dbl in another. R users would expect type promotion at this point. But also notice the count column is also NaN-out (the input legitimately did have a NaN, but notice how we lost all values).

reprex below.

Also I am seeing a lot of warnings of the form:

Warning message:
Translator is missing window functions:
count, n_distinct

Is there some way to force sparklyr to re-install its local sparks? I am running on MacOS.

suppressPackageStartupMessages(library("dplyr"))
library("sparklyr")
packageVersion("dplyr")
#> [1] '0.7.1.9000'
packageVersion("sparklyr")
#> [1] '0.5.6.9008'

my_db <- sparklyr::spark_connect(version='2.0.2', 
                                 master = "local")

a <- sparklyr::spark_read_parquet(my_db, 'a', '~/data/a')
b <- sparklyr::spark_read_parquet(my_db, 'b', '~/data/b')

print(a)
#> # Source:   table<a> [?? x 3]
#> # Database: spark_connection
#>    year count  name
#>   <dbl> <dbl> <chr>
#> 1  2005     6     a
#> 2  2007     1     b
#> 3  2010   NaN     c

print(b, n=100)
#> # Source:   table<b> [?? x 3]
#> # Database: spark_connection
#>     year  name count
#>    <int> <chr> <dbl>
#>  1  2006     a     0
#>  2  2007     a     0
#>  3  2008     a     0
#>  4  2009     a     0
#>  5  2010     a     0
#>  6  2005     b     0
#>  7  2006     b     0
#>  8  2008     b     0
#>  9  2009     b     0
#> 10  2010     b     0
#> 11  2005     c     0
#> 12  2006     c     0
#> 13  2007     c     0
#> 14  2008     c     0
#> 15  2009     c     0
#> 16  2005     d     0
#> 17  2006     d     0
#> 18  2007     d     0
#> 19  2008     d     0
#> 20  2009     d     0
#> 21  2010     d     0

v <- sparklyr::sdf_bind_rows(list(a,b))

print(v, n=1000)
#> # Source:   table<sparklyr_tmp_be3445305142> [?? x 3]
#> # Database: spark_connection
#>     year count  name
#>    <dbl> <dbl> <chr>
#>  1   NaN   NaN     a
#>  2   NaN   NaN     b
#>  3   NaN   NaN     c
#>  4   NaN   NaN     a
#>  5   NaN   NaN     a
#>  6   NaN   NaN     a
#>  7   NaN   NaN     a
#>  8   NaN   NaN     a
#>  9   NaN   NaN     b
#> 10   NaN   NaN     b
#> 11   NaN   NaN     b
#> 12   NaN   NaN     b
#> 13   NaN   NaN     b
#> 14   NaN   NaN     c
#> 15   NaN   NaN     c
#> 16   NaN   NaN     c
#> 17   NaN   NaN     c
#> 18   NaN   NaN     c
#> 19   NaN   NaN     d
#> 20   NaN   NaN     d
#> 21   NaN   NaN     d
#> 22   NaN   NaN     d
#> 23   NaN   NaN     d
#> 24   NaN   NaN     d

Most helpful comment

BTW the latest devel should have the dbplyr warnings fixed

All 4 comments

Attaching zip file of data.
data.zip

Looks like a bug. I'll look into it. For the time being, if your tables have the same column names, rbind() seems to be OK.

BTW the latest devel should have the dbplyr warnings fixed

Fix looks good to me. Thank you very much!

Was this page helpful?
0 / 5 - 0 ratings