Dplyr: collect by default loads only 100,000 rows

Created on 27 Jun 2016  路  12Comments  路  Source: tidyverse/dplyr

Hi, version 0.5 loads only 100,000 when you call collect of tbl_sql. It is quite annoying as it affects lots of old code. Is this intentional?

Most helpful comment

100,000 rows is not very many these days. We were on 4.5 for 5 minutes and reverted since we were unprepared for this change and would have to change a lot of collects everywhere. I think this has the potential to silently break things, even with the awareness that there is a default limit, because you always have to remember to unlimit. Slowness gets noticed.

All 12 comments

I think this was always the intent - so that you had to deliberately opt in to download a huge amount of data, not do it by accident. You can work around it with n = Inf or as.data.frame(). I'll probably reconsider this behaviour.

100,000 rows is not very many these days. We were on 4.5 for 5 minutes and reverted since we were unprepared for this change and would have to change a lot of collects everywhere. I think this has the potential to silently break things, even with the awareness that there is a default limit, because you always have to remember to unlimit. Slowness gets noticed.

Hadley, thank you for reconsidering :)

Yes it breaks some old unmaintained packages (e.g. the rplexos package). I could modify these but would prefer a different default behaviour.

The argument n=Inf is also not documented in the help page of collect. This might be a useful addition since it helps user to be aware of the default restriction.

@hadley I have a ton of collect() calls in a codebase. Any ideas or guidance for an easy global adjustment to the new style?

+1 here for changing the default for collect back to Inf. This is breaking a _lot_ of old code. It's also quite dangerous since it silently caps the number of rows returned.

@cwbishop, well it returns a warning... but otherwise your point is sound. :)

@drknexus , very strange. I do not see a warning of any kind. So, in my case at least, it is completely silent. I just happened to do a row count and saw the suspicious number.

Just my very humble opinion on this issue, after ending up here because of an unrelated pool-dplyr bug. @hadley, it seems like a lot of people are relying on the old version of collect and this change somewhat breaks that code. Could a reasonable thing to do be to have the default as n = Inf, but do print a warning or message indicating that n should be set explicitly to avoid downloading all the data when it is not necessary..? I feel like the case for having to deliberately opt in to download a huge amount of data is not as strong for a collect call as it is for other dplyr calls. Because when you are calling collect, you are already forcing the computing of an otherwise lazy table. So, you are already saying you don't want to be lazy, which could be intuitively perceived to also mean "get the whole table" (especially if this used to be the case).

Yes we'll definitely fix this

Thanks!

Was this page helpful?
0 / 5 - 0 ratings