User.order(:updated_at).find_in_batches(start: 2000, finish: 5000, batch_size: 500) { |batch| do_something }
The records must have been ordered by column 'updated_at'.
Expected Queries:
SELECT `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 2000
SELECT `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 2500
SELECT `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 3000
SELECT `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 3500
SELECT `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 4000
SELECT `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 4500
The records are always order by the primary key(id by default) no matter how user wants them to be ordered.
Queries:
SELECT `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) ORDER BY `users`.`id` ASC LIMIT 500
SELECT `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 2499) ORDER BY `users`.`id` ASC LIMIT 500
SELECT `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 2999) ORDER BY `users`.`id` ASC LIMIT 500
SELECT `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 3499) ORDER BY `users`.`id` ASC LIMIT 500
SELECT `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 3999) ORDER BY `users`.`id` ASC LIMIT 500
SELECT `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 4499) ORDER BY `users`.`id` ASC LIMIT 500
SELECT `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 4999) ORDER BY `users`.`id` ASC LIMIT 500
Rails Version: Rails 5.1.0.alpha
Ruby Version: 2.2.3
I've customized the method in_batches
which is called from both find_in_batches
and find_each
for now to solve my problem.
Also I've attached a PR having the permanent fix for this issue. Please review it and let me know if any change is required.
As I know, this is an consequence of a performance improvement from #20933. I found this behaviour in the past and was surprised, too. But this behaviour is understandable. I think, implementation of two different methods is a good idea. Leave default in_batches
as is and create something like in_ordered_batches
.
User.order(:updated_at)
so will it always fetch same records?
1000 records updated -> timestamp updated -> fetch last 1000 records updated -> ooops
So we get infinite loop.
I don't think we can use something complex in batches like order clause by any column, cause it can lead to something weird. However I think allowing user to ORDER BY id DESC
instead of default ASC
must exist.
I don't think we can use something complex in batches like order clause by any column, cause it can lead to something weird. However I think allowing user to
ORDER BY id DESC
instead of defaultASC
must exist.
The desc
ordering will be part of Rails 6.1 iiuc. And about the "something weird can happen argument": if this is only technical so be it, but from a consumer perspective we can always write code that does weird stuff anyway. Ordered find_each
would be great. I would suspect that more bugs stem from find_each
not ordering then from non-transactional-locked ordered find_eachs.
One strategy could also be that find_each
throws an exception if the Relation includes an order different to the PK. And has an anti-guard (along the lines of: Data.order(magnitude: :desc).find_each(do_not_rais_on_order: true)
).