User.order(:updated_at).find_in_batches(start: 2000, finish: 5000, batch_size: 500) { |batch| do_something }
The records must have been ordered by column 'updated_at'.
Expected Queries:
SELECT `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 2000
SELECT `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 2500
SELECT `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 3000
SELECT `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 3500
SELECT `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 4000
SELECT `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 4500
The records are always order by the primary key(id by default) no matter how user wants them to be ordered.
Queries:
SELECT `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) ORDER BY `users`.`id` ASC LIMIT 500
SELECT `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 2499) ORDER BY `users`.`id` ASC LIMIT 500
SELECT `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 2999) ORDER BY `users`.`id` ASC LIMIT 500
SELECT `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 3499) ORDER BY `users`.`id` ASC LIMIT 500
SELECT `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 3999) ORDER BY `users`.`id` ASC LIMIT 500
SELECT `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 4499) ORDER BY `users`.`id` ASC LIMIT 500
SELECT `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 4999) ORDER BY `users`.`id` ASC LIMIT 500
Rails Version: Rails 5.1.0.alpha
Ruby Version: 2.2.3
I've customized the method in_batches which is called from both find_in_batches and find_each for now to solve my problem.
Also I've attached a PR having the permanent fix for this issue. Please review it and let me know if any change is required.
As I know, this is an consequence of a performance improvement from #20933. I found this behaviour in the past and was surprised, too. But this behaviour is understandable. I think, implementation of two different methods is a good idea. Leave default in_batches as is and create something like in_ordered_batches.
User.order(:updated_at) so will it always fetch same records?
1000 records updated -> timestamp updated -> fetch last 1000 records updated -> ooops
So we get infinite loop.
I don't think we can use something complex in batches like order clause by any column, cause it can lead to something weird. However I think allowing user to ORDER BY id DESC instead of default ASC must exist.
I don't think we can use something complex in batches like order clause by any column, cause it can lead to something weird. However I think allowing user to
ORDER BY id DESCinstead of defaultASCmust exist.
The desc ordering will be part of Rails 6.1 iiuc. And about the "something weird can happen argument": if this is only technical so be it, but from a consumer perspective we can always write code that does weird stuff anyway. Ordered find_each would be great. I would suspect that more bugs stem from find_each not ordering then from non-transactional-locked ordered find_eachs.
One strategy could also be that find_each throws an exception if the Relation includes an order different to the PK. And has an anti-guard (along the lines of: Data.order(magnitude: :desc).find_each(do_not_rais_on_order: true)).