Dplyr: Row combinations are missing parentheses

Created on 28 Nov 2016  路  7Comments  路  Source: tidyverse/dplyr

I am seeing a lot of different (execution path dependent) behaviors of dplyr::union_all and dplyr::union using various database backends (SQLite is just an easy example). In one case called union_all on two single row remote frames and got 3 rows of results (second frame duplicated). This happened with or without calling dplyr::compute() to clean up the frames (they were results of dplyr calculations including head()) before the union_all call. I can reproduce this, but only as a complicated calculation. An easier to reproduce issue is the following (which does go away if you call dplyr::compute, but I am hoping it points to the problem or related problems): an exception from dplyr::union_all.

library('dplyr')
 #  
 #  Attaching package: 'dplyr'
 #  The following objects are masked from 'package:stats':
 #  
 #      filter, lag
 #  The following objects are masked from 'package:base':
 #  
 #      intersect, setdiff, setequal, union
packageVersion('dplyr')
 #  [1] '0.5.0'
my_db <- dplyr::src_sqlite("replyr_sqliteEx.sqlite3", create = TRUE)
dr <- dplyr::copy_to(my_db,
                     data.frame(x=c(1,2),y=c('a','b'),stringsAsFactors = FALSE),'dr',
                     overwrite=TRUE)
dr <- head(dr,1)
# dr <- compute(dr)
print(dr)
 #  Source:   query [?? x 2]
 #  Database: sqlite 3.8.6 [replyr_sqliteEx.sqlite3]
 #  
 #        x     y
 #    <dbl> <chr>
 #  1     1     a
print(dplyr::union_all(dr,dr))
 #  Source:   query [?? x 2]
 #  Database: sqlite 3.8.6 [replyr_sqliteEx.sqlite3]
 #  Error in sqliteSendQuery(conn, statement): error in statement: LIMIT clause should come after UNION ALL not before

I am having a lot of issues with union_all and Spark2.0.0 also.

More details here.

bug

All 7 comments

Thanks. The behavior you show is faulty, it would be great if you could provide a reprex for the row duplication you're seeing, too -- perhaps in a new issue.

I did not keep enough notes and can not reproduce the row duplication issue (and I am wondering if it was on PostgreSQL or Spark2.0.0 as I was testing both). I think the row-duplication may have been related to "unknown column types" killing my join condition and therefore actually correct behavior (when you insert a string constant without saying "as.character()"). So it may not be evidence of a separate bug, just some ugly data that caused my join condition to fail without me noticing. If it pops back up I'll log it.

Simpler reprex:

library('dplyr', warn.conflicts = FALSE)
mf <- memdb_frame(x = 1:2)
m1 <- head(mf, 1)

# Both should return 2 rows
mf1 <- mf %>% union(m1)
mf1
#> Source:   query [?? x 1]
#> Database: sqlite 3.11.1 [:memory:]
#> 
#>       x
#>   <int>
#> 1     1
mf1 %>% show_query()
#> <SQL>
#> SELECT *
#> FROM `idtfllmtan`
#> UNION
#> SELECT *
#> FROM (SELECT *
#> FROM `idtfllmtan`)
#> LIMIT 1

m1f <- m1 %>% union(mf)
m1f
#> Source:   query [?? x 1]
#> Database: sqlite 3.11.1 [:memory:]
#> Error in rsqlite_send_query(conn@ptr, statement): LIMIT clause should come after UNION not before
m1f %>% show_query()
#> <SQL>
#> SELECT *
#> FROM (SELECT *
#> FROM `idtfllmtan`)
#> LIMIT 1
#> UNION
#> SELECT *
#> FROM `idtfllmtan`

The problem is that the limit statement is put in the wrong place.

It looks like with SQLite that limit can only be applied to the compound select statement, not the individual select statements. I think the best we can do here is give a good error message.

But it looks like this is ok in postgresql and mysql providing that you add the missing parens. Might be ok here to simply add the extra needed parens and fall back to the database if the form isn't supported (like for SQLite)

I can fix PostgreSQL and MySQL by adding parentheses, but this will break SQLite for the simple unions that currently work.

I don't think it's possible to make SQLite correctly and elegantly error with the current design. sql_set_op() dispatches on the the connection, but only receives two SQL strings as input. sql_build.op_set_op() receives richer data structures, but doesn't dispatch on the connection.

For now, I'll add a manual check in sql_build.op_set_op(). I won't consider a more general design until we have more connection-specific sql generation/checking needs.

Put it in add_op_set_op() for now, since that catches it during creation rather than sql generation.

@JohnMount I'm reasonably certain this should fix the weirdness you've been seeing. Please open a new issue if you find other problems.

Was this page helpful?
0 / 5 - 0 ratings