Rails: Reordering issue in find_in_batches and find_each

Created on 31 Jul 2016  路  3Comments  路  Source: rails/rails

Steps to reproduce

User.order(:updated_at).find_in_batches(start: 2000, finish: 5000, batch_size: 500) { |batch| do_something }

Expected behavior

The records must have been ordered by column 'updated_at'.

Expected Queries:

SELECT  `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 2000
SELECT  `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 2500
SELECT  `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 3000
SELECT  `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 3500
SELECT  `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 4000
SELECT  `users`.* FROM `users` ORDER BY `users`.`updated_at` ASC LIMIT 500 OFFSET 4500

Actual behavior

The records are always order by the primary key(id by default) no matter how user wants them to be ordered.

Queries:

SELECT  `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) ORDER BY `users`.`id` ASC LIMIT 500
SELECT  `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 2499) ORDER BY `users`.`id` ASC LIMIT 500
SELECT  `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 2999) ORDER BY `users`.`id` ASC LIMIT 500
SELECT  `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 3499) ORDER BY `users`.`id` ASC LIMIT 500
SELECT  `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 3999) ORDER BY `users`.`id` ASC LIMIT 500
SELECT  `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 4499) ORDER BY `users`.`id` ASC LIMIT 500
SELECT  `users`.* FROM `users` WHERE (`users`.`id` >= 2000) AND (`users`.`id` <= 5000) AND (`users`.`id` > 4999) ORDER BY `users`.`id` ASC LIMIT 500

Rails Version: Rails 5.1.0.alpha
Ruby Version: 2.2.3

I've customized the method in_batches which is called from both find_in_batches and find_each for now to solve my problem.

Also I've attached a PR having the permanent fix for this issue. Please review it and let me know if any change is required.

activerecord attached PR

All 3 comments

As I know, this is an consequence of a performance improvement from #20933. I found this behaviour in the past and was surprised, too. But this behaviour is understandable. I think, implementation of two different methods is a good idea. Leave default in_batches as is and create something like in_ordered_batches.

User.order(:updated_at) so will it always fetch same records?
1000 records updated -> timestamp updated -> fetch last 1000 records updated -> ooops
So we get infinite loop.

I don't think we can use something complex in batches like order clause by any column, cause it can lead to something weird. However I think allowing user to ORDER BY id DESC instead of default ASC must exist.

I don't think we can use something complex in batches like order clause by any column, cause it can lead to something weird. However I think allowing user to ORDER BY id DESC instead of default ASC must exist.

The desc ordering will be part of Rails 6.1 iiuc. And about the "something weird can happen argument": if this is only technical so be it, but from a consumer perspective we can always write code that does weird stuff anyway. Ordered find_each would be great. I would suspect that more bugs stem from find_each not ordering then from non-transactional-locked ordered find_eachs.

One strategy could also be that find_each throws an exception if the Relation includes an order different to the PK. And has an anti-guard (along the lines of: Data.order(magnitude: :desc).find_each(do_not_rais_on_order: true)).

Was this page helpful?
0 / 5 - 0 ratings