Silverstripe-framework: [GRIDFIELD] BUG: pagination in GridField does not work if DataObject default_sort field values are not unique

Created on 7 Aug 2017  Â·  54Comments  Â·  Source: silverstripe/silverstripe-framework

SUMMARY:

Pagination does not work well when the data is provided by MySQL 5.6 / 5.7 (php 7.1) where the data-set is sorted by a field (or fields) that are not enforced to be unique on a database level. The reason for this is that MySQL may return datasets in a random order (beyond what it has been asked to sort by).

You can possibly replicate this bug by yourself by using the tools listed below:

This means, for example, that:

  • batch processing may not work correctly (you will process some records twice and other ones not at all)
  • grid field pagination is broken (some items listed twice or more, on separate pages, others not at all)
  • pagination on the front-end is broken (some items listed twice or more, on separate pages, others not at all)

There are still a lot of questions about this bug, but I have been able to consistently demonstrate this error on several machines with clean installs of 3.6.1

Details

the issue: when the default_sort value for a dataobject contains non-unique values for one or more rows then the GridField ends up NEVER showing some records while showing other records more than once:

How to replicate:

  1. install SS 3.6.1 using standard installer

  2. add the following dataobject (MyDataObject):

<?php


class MyDataObject extends DataObject
{

    private static $db = array(
        'Title' => 'Varchar(20)',
        'SameSame' => 'Varchar(20)'
    );


    private static $default_sort = array(
        'SameSame' => 'ASC'
    );

    function requireDefaultRecords()
    {
        DB::query('DELETE FROM MyDataObject');
        for($i = 1; $i < 47; $i++) {
            $filter = array('Title' => "item #".$i);
            if(! MyDataObject::get()->filter($filter)->first()) {
                $obj = MyDataObject::create($filter);
                $obj->SameSame = 'same';
                $obj->write();
            }
        }
    }

}

then add modeladmin (MyModelAdmin.php):

<?php

class MyModelAdminTest extends ModelAdmin {

    private static $managed_models = array('MyDataObject');

    private static $url_segment = 'test';

    private static $menu_title = 'Test';
}

You get the following results:
image

image

For my client, I was sorting a GridField by EndDate and StartDate. Some of the items has the same EndDate and StartDate and thus were missing. Making it impossible for them to be found by the client.

affectv4 impaclow typbug

All 54 comments

Have you tried adding ID as the last item in the list?
This does look like a genuine bug, but perhaps adding ID would be a workaround for it

@flamerohr - i fixed my code as soon as I noticed the bug by adding ID as a sort key and that should also be the solution for this bug. Data Objects can be sorted in any way, but sort by ID ASC/DESC should always be added as the last sort criteria, no matter what.

Check this out:

I have now added to Page_Controller:

    public function PaginatedObjects()
    {
        $list = MyDataObject::get();

        return new PaginatedList($list, $this->getRequest());
    }

and I have added to /themes/simple/templates/Layout/Page.ss:

    <ul>
        <% loop $PaginatedObjects %>
            <li>$Title</li>
        <% end_loop %>
    </ul>

    <% if $PaginatedObjects.MoreThanOnePage %>
        <% loop $PaginatedObjects.Pages %>
            <% if $CurrentBool %>
                $PageNum
            <% else %>
                <% if $Link %>
                    <a href="$Link">$PageNum</a>
                <% else %>
                    ...
                <% end_if %>
            <% end_if %>
            <% end_loop %>
    <% end_if %>

PAGE 1:
image

PAGE 2:

image

PAGE 3:
image

AMAZING!

I agree with you that this is a critical bug and I am very surprised to find it.

It is so serious because anytime you do not sort by a field that is enforced to be unique on a database level, i.e. in basically all cases, e.g. sorting by Title, Created, LastEdited, SortNumber (SortableGridField), Sort (SiteTree), by Date, and so on.... ) you will potentially end up with missing / double-up records in a GridField / Paginated List.

unfortunately, this is a potential problem in any given relational database, many queries do not explicitly provide a sort (or not a sort with at least a unique column) and has the "implied" sorting by ID, which is what happened here.

I've marked as critical because of data disappearing, but I encourage always adding explicit ID sort to the end of any sort calls.

I can't replicate this on 3.6.1. Can you test it on a plain install?

I run this on a plain 3.6.1

I wonder if the auto increment on your ID is isn't working properly. Can you try deleting the entire DB and starting over? I'm curious if you're getting IDs that are non sequential with your Item numbers due to all the deletions. Try adding ID as a summary_field to help debug.

screenshot 2017-08-08 10 53 58

No problems with IDs

Title | ID
-- | --
item no. 1 | 334
item no. 2 | 335
item no. 3 | 336
item no. 4 | 337
item no. 5 | 338
item no. 6 | 339
item no. 7 | 340
item no. 8 | 341
item no. 9 | 342
item no. 10 | 343
item no. 11 | 344
item no. 12 | 345
item no. 13 | 346
item no. 14 | 347
item no. 15 | 348
item no. 16 | 349
item no. 17 | 350
item no. 18 | 351
item no. 19 | 352
item no. 20 | 353
item no. 21 | 354
item no. 22 | 355
item no. 23 | 356
item no. 24 | 357
item no. 25 | 358
item no. 26 | 359
item no. 27 | 360
item no. 28 | 361
item no. 29 | 362
item no. 30 | 363
item no. 31 | 364
item no. 32 | 365
item no. 33 | 366
item no. 34 | 367
item no. 35 | 368
item no. 36 | 369
item no. 37 | 370
item no. 38 | 371
item no. 39 | 372
item no. 40 | 373
item no. 41 | 374
item no. 42 | 375
item no. 43 | 376
item no. 44 | 377
item no. 45 | 378
item no. 46 | 379

This error actually goes back to mysql. We are running

SELECT * FROM MyDataObject ORDER BY SameSame LIMIT 0,10
SELECT * FROM MyDataObject ORDER BY SameSame LIMIT 10,10
SELECT * FROM MyDataObject ORDER BY SameSame LIMIT 20,10

And Mysql returns the records in a random order - although the first page of pagination seems to refute this theory.

but I encourage always adding explicit ID sort to the end of any sort calls.

@flamerohr - I have not come across many (if any) modules that explicitly sort by ID. For example SiteTree does not sort by ID and there will definitely be situations where the Sort values are not unique in SiteTree from what I know as this is not enforced on a Database level.

image

What kind of output do you get when you just run SELECT Title FROM MyDataObject ORDER BY SameSame ASC in the database?

I run the following code:

    public function MyDataObjectRaw()
    {
        $this->html .= '<br />-------------------------------------';
        $rows = DB::query('SELECT Title, ID FROM MyDataObject ORDER BY SameSame ASC;');
        foreach($rows as $row) {
            $this->html .= '<li>'.$row['Title'].' = '.$row['ID'];
        }
        $this->html .= '<br />-------------------------------------';
        $rows = DB::query('SELECT Title, ID FROM MyDataObject ORDER BY SameSame ASC LIMIT 0, 10;');
        foreach($rows as $row) {
            $this->html .= '<li>'.$row['Title'].' = '.$row['ID'];
        }
        $this->html .= '<br />-------------------------------------';
        $rows = DB::query('SELECT Title, ID FROM MyDataObject ORDER BY SameSame ASC LIMIT 10, 10;');
        foreach($rows as $row) {
            $this->html .= '<li>'.$row['Title'].' = '.$row['ID'];
        }
        $this->html .= '<br />-------------------------------------';
        $rows = DB::query('SELECT Title, ID FROM MyDataObject ORDER BY SameSame ASC LIMIT 20, 10;');
        foreach($rows as $row) {
            $this->html .= '<li>'.$row['Title'].' = '.$row['ID'];
        }
    }

and the output that I get is as follows:

item No.36 = 369
item No.25 = 358
item No.26 = 359
item No.27 = 360
item No.28 = 361
item No.29 = 362
item No.30 = 363
item No.31 = 364
item No.32 = 365
item No.33 = 366
item No.34 = 367
item No.35 = 368
item No.24 = 357
item No.37 = 370
item No.38 = 371
item No.39 = 372
item No.40 = 373
item No.41 = 374
item No.42 = 375
item No.43 = 376
item No.44 = 377
item No.45 = 378
item No.46 = 379
item No.13 = 346
item No.2 = 335
item No.3 = 336
item No.4 = 337
item No.5 = 338
item No.6 = 339
item No.7 = 340
item No.8 = 341
item No.9 = 342
item No.10 = 343
item No.11 = 344
item No.12 = 345
item No.1 = 334
item No.14 = 347
item No.15 = 348
item No.16 = 349
item No.17 = 350
item No.18 = 351
item No.19 = 352
item No.20 = 353
item No.21 = 354
item No.22 = 355
item No.23 = 356

limit stuff 0 -> 9

item No.1 = 334
item No.2 = 335
item No.3 = 336
item No.4 = 337
item No.5 = 338
item No.6 = 339
item No.7 = 340
item No.8 = 341
item No.9 = 342
item No.10 = 343

limit stuff 10 -> 19

item No.12 = 345
item No.1 = 334
item No.10 = 343
item No.9 = 342
item No.8 = 341
item No.7 = 340
item No.6 = 339
item No.5 = 338
item No.4 = 337
item No.3 = 336

limit stuff 19 -> 29

item No.12 = 345
item No.11 = 344
item No.10 = 343
item No.9 = 342
item No.8 = 341
item No.7 = 340
item No.6 = 339
item No.5 = 338
item No.4 = 337
item No.3 = 336

I'm not convinced this is a SilverStripe issue. The way that MySQL resolves an ambiguous sort is highly unpredictable. See (https://dba.stackexchange.com/questions/6051/what-is-the-default-order-of-records-for-a-select-statement-in-mysql).

MySQL sorts the rows any way it wants to, without any guarantee of consistency.

That I can't replicate it should be an indication that we're dealing with some kind of unknown. It could be as simple as differing storage engines, MySQL versions, or both. In short, don't use ambiguous sorts.

Hi Aaron,

I'm not convinced this is a SilverStripe issue.

It is definitely a MYSQL issue - but we can't change MYSQL so we will have to deal with the weirdness in Silverstripe.

In short, don't use ambiguous sorts.

That is true in theory, but I don't think that is true practice, most of the time you are sorting by an ambiguous sort field: e.g.

  • Sort / SortNumber (which are usually not database enforced uique),
  • Title
  • Name,
  • Calendar Start Date,
  • Created,
  • LastEdited,

etc...

Anytime they are not unique, you could end up with this problem. So, I reckon we should fix this, for the following reasons:

(1)
I am sure that most default_sort entries, such as SiteTree do not sort by a database enforced unique key (as per above). If any of them are displayed in a paginated list (e.g. a gridfield, search results, filtered list, etc... etc... ) there is a real chance that some will show up twice on different pages and that other items will never show.

(2) There is an easy fix - add an ORDER BY BaseTable.ID to the end of all SQL Sort statements

(3) The danger with this bug is that no one will notice until it has real life consequences. There is no error, these lists seem to operate fine, etc...

(4) A perfect developer should remember to add the Sort By ID to the end of all sort statements that do not include an unambigious field - but if
(a) you always need to add this no matter what, AND
(b) few developers actually do this in practice -
then it appears to me that the Silverstripe Core should take care of this.

I could replicate this behaviour on my lappy:

| Variable_name           | Value                   |
+-------------------------+-------------------------+
| innodb_version          | 5.7.19                  |
| protocol_version        | 10                      |
| slave_type_conversions  |                         |
| tls_version             | TLSv1,TLSv1.1           |
| version                 | 5.7.19-0ubuntu0.16.04.1 |
| version_comment         | (Ubuntu)                |
| version_compile_machine | x86_64                  |
| version_compile_os      | Linux                   |
+-------------------------+-------------------------+

and on our server

| Variable_name           | Value                                                |
+-------------------------+------------------------------------------------------+
| innodb_version          | 5.6.34-79.1                                          |
| protocol_version        | 10                                                   |
| slave_type_conversions  |                                                      |
| tls_version             | TLSv1.1,TLSv1.2                                      |
| version                 | 5.6.34-79.1                                          |
| version_comment         | Percona Server (GPL), Release 79.1, Revision 1c589f9 |
| version_compile_machine | x86_64                                               |
| version_compile_os      | debian-linux-gnu                                     |
+-------------------------+------------------------------------------------------+

and here is my work machine (issue is also happening here):

| Variable_name           | Value                   |
+-------------------------+-------------------------+
| innodb_version          | 5.7.19                  |
| protocol_version        | 10                      |
| slave_type_conversions  |                         |
| tls_version             | TLSv1,TLSv1.1           |
| version                 | 5.7.19-0ubuntu0.16.04.1 |
| version_comment         | (Ubuntu)                |
| version_compile_machine | x86_64                  |
| version_compile_os      | Linux                   |
+-------------------------+-------------------------+

sunnysideup/ecommerce_merchants | allow many businesses to sell within one shop. | 0.00 per day
sunnysideup/ecommerce_modifier_example | provides an example of a modifier to help you create your ow... | 0.00 per day
sunnysideup/ecommerce_mybusinessworld | This module adds a basic connection between silverstripe e-c... | 0.00 per day
sunnysideup/ecommerce_nz_connectivity | New Zealand Post - Client for the rate finder web API for Si... | 0.00 per day
sunnysideup/ecommerce_product_variation_colours | add colours to your silverstripe e-commerce product variatio... | 0.00 per day
sunnysideup/ecommerce_repeatorders | silverstripe-ecommerce_repeatorders | 0.00 per day

Here is another test I ran:

Page:

    function requireDefaultRecords()
    {
        for($i = 1; $i < 400; $i++) {
            DB::alteration_message('Creating Pages: '.$i);
            $filter = ['Title' => 'Page No. '.$i];
            $page = Page::get()->filter($filter)->first();
            if(! $page) {
                $page = Page::create($filter);
            }
            if($i > 10) {
                $parentPage = Page::get()->filter(array('ID:LessThan' => 10))->sort('RAND()')->first();
                if($parentPage) {
                    $page->ParentID = $parentPage->ID;
                }
            } else {
                $page->ParentID = 0;
            }
            $page->writeToStage('Stage');
            $page->publish('Stage', 'Live');
        }
    }

Page_Controller:

    public function UnpaginatedPages()
    {
        return Page::get();
    }

    public function PaginatedPages()
    {
        $list = Page::get();

        $list = new PaginatedList($list, $this->getRequest());
        $list->setPageLength(395);
        return $list;
    }

Page.ss

    <% loop PaginatedPages %>
        <li>$Title</li>
    <% end_loop %>

    <% if $PaginatedPages.MoreThanOnePage %>
        <% loop $PaginatedPages.Pages %>
            <% if $CurrentBool %>
                $PageNum
            <% else %>
                <% if $Link %>
                    <a href="$Link">$PageNum</a>
                <% else %>
                    ...
                <% end_if %>
            <% end_if %>
            <% end_loop %>
    <% end_if %>

Pages show up in random order rather than the order they were added because the Sort value for all pages is 1:
image

Having all the Page.Sort values at 1 means that it is very likely this bug occurs.

(2) There is an easy fix - add an ORDER BY BaseTable.ID to the end of all SQL Sort statements

I really think this is dangerous scope creep for the ORM, and it sets a really bad precedent. The ORM is an abstraction layer between you and SQL. It's primary purpose is to make querying more succinct, more secure, and more declarative. It is not the ORM's job to fix your bugs. It should be very possible to write bad queries with the ORM, and sorting by something you know to have a singular value is one of those areas.

(a) you always need to add this no matter what

Personally, I've never had to do this. Keep in mind, you're testing a very extreme case, where all the values of a column are equal, and in any real-world scenario, it would be apparent to the developer that he was sorting on a column with uniform values. If, by contrast, I sort by Title ASC, and 2/100 records have the same title, I'm not sure I've ever noticed any consequences for the logic MySQL uses to resolve that conflict. If I did, I'd add a second column.

The best practice, in these scenarios, is to encourage developers to implement composite default_sort values for their models, if they are aware that their sort column is not unique. I concur with @unclecheese on this.

E.g.

private static $default_sort = '"BaseTable"."Sort" ASC, "BaseTable"."ID" ASC';

These can also be specified via array syntax.

private static $default_sort = [
    'Sort' => 'ASC',
    'ID' => 'ASC'
];

I believe the array syntax supports ORM to SQL column mapping under the hood, so that might be preferable to raw SQL fragments. :)

Shall we make the outcome of this problem a documentation update?

I agree, documentation to encourage this would be the best outcome for this...

I don't believe enforcing an ID sort at the end of all queries which uses a sort is a good idea.

The SiteTree.default_sort also not having ID doesn't mean it's best practice :) in fact if anyone created a pull request to add ID in for SiteTree, I think it'd be a welcomed change.

The best practice, in these scenarios, is to encourage developers to implement composite default_sort values for their model

@tractorcow

What about SiteTree.Sort and Blog:

https://github.com/silverstripe/silverstripe-blog/blob/master/src/Model/BlogPost.php#L134-L139

Both models are bound to have non-unique sort values.

We could add ID to those models exclusively, but that's still preferable to baking it into datalist for all classes globally.

example 1:

On an e-commerce website we offer discount vouchers. We had set the default_sort to EndDate DESC, StartDate DESC as that made sense at the time. In the modelAdmin that worked great when we tested it.

Then the client adds a whole bunch of discount vouchers with the same start and end dates.

When he is trying to delete them, there are around 8 / 46 that are not showing up in the ModelAdmin (and another 8 that show up twice) ...

example 2:

Member List for Club, Ms Smith can not find herself in list, calls admin, admin finds her on page three of list, sends her a screenshot and a link to page three. Ms Smith can not find herself on that page (there are a few Smith people, list is sorted by last name).

What can we do?

I am not married to a solution, but it seems to me that this is definitely something we should improve for our users because they have much to loose and nothing to gain from Mysql seemingly random sorts.

I can see the following options:

  1. add ID sort to all ORM get calls that do not use a non-unique Sort Field.

  2. add ORDER BY ID ASC at the end of any ORDER BY statements to all Mysql select statements. If we could work out when / what versions are affected then we could limit it to those versions?

  3. provide a yml option for (2)

  4. improve documentation about this

  5. add ID sort to PaginatedList and GridField pagination

SiteTree.Sort isn’t supposed to be unique, is it? The sort is relative to the parent page

@kinglozzer no, SiteTree.Sort is not supposed to be unique, but what I found is that when you sort by a non-unique field (e.g. SiteTree / BlogPost - and 100s of other DataObjects) then pagination (including GridField pagination) can do pretty crazy things as Mysql is rather unpredictable in its sortorder when there are non-unique values (i.e. it sorts correctly, but rows with the same values show up in a random order). This means that some items show up twice on separate pages and other items never show up at all.

Here is my code as a repository:

https://github.com/sunnysideup/silverstripe-sort-test

I did some further testing. It does not happen with pages, it only happens with DataObjects.

I think this is really getting into a philosophical discussion on what is the role of framework/ORM. That seems to be where we're divided. My thoughts:

If you assign a singular sort clause to a DataObject for a column that is not entirely unique, and it produces unexpected results, that's on you as the developer who wrote that code. To expect the framework to jump in and save you is undue, for the same reason the template parser doesn't close your </p> tags.

If we were to implicitly put ID ASC in the sort of every DataList, we would be punishing every developer who went about it the right way, and used unique columns, or composite clauses. There's a performance impact for having multiple sort columns, and it's nothing I would want to impose on every query without a full opt-in from the user. If I'm sorting on something almost guaranteed to be unique, like Created, I don't need or want ID in the mix as well, and if I have to write special code to tell the framework not to do that, that creates a really suboptimal developer experience.

Here are some alternative ideas:

  • add sort by ID to GridField paginated lists?
  • add sort by ID to Paginated lists (optional)?

Also - Created is often not unique (e.g. import of data - import through third-party API - high volume sales).

I think firstly that the bug needs more research (exactly when does it happen and how / why).

Both GridField and PaginatedList are ORM agnostic. They're designed to work only with instances of SS_List, so injecting concerns over ID into either of them is inappropriate.

The fact that many users would be punished by this safety feature is really a non-starter for me. I don't think it aligns well with the core values of the product.

I tried sorting by Created - same issue (most objects are created within one second).

That's not the point, though. In a real-world scenario where you were using data that was all generated programatically in a matter of milliseconds, you would intuit the Created column to be unreliable for sort, and use a different column, or a composite sort clause. What you can demonstrate in test code should not influence the decisions we make that affect the developer experience.

Developers are responsible for understanding the idiosyncrasies of their data, and to install guardrails in the framework to protect them from writing bad code sets a terribly dangerous precedent. If it is acceptable for the ORM to audit the efficacy of your sort clause (at the expense of performance), it logically follows that the framework is permitted to make all kinds of other overreach in the name of developer hand-holding, like closing <p> tags, preventing redirect loops, removing unnecessary queries, etc.

:-)

Thank you for all your attention to detail in this bug report.

These are my assumptions right now:

  • the problem exists
  • it may negatively affect our users in a significant way
  • most Silverstripe code out there does not follow best practice in relation to this issue (i.e. most code does not add a Sort By Unique Field to pagination / gridfields).

Reply from SS core team is: fix your own code, it is not good to punish best practice by adding a hack for sloppy code (e.g. always sorting by ID to help out lazy developers).

I think we need to test the assumptions a bit more first. I have set up a small repo other developers can see the bug in action: https://github.com/sunnysideup/silverstripe-sort-test/.

Thank you again for looking into all of this.

Have added an issue to the backlog. Closing as an actionable story has been created.

I just realised then when you sort by one the headers in a gridfield, you end up reproducing the same bug.

I am now sorting my dataobjects by ID
image

That works in the GridField:

image

But when I click on SameSame, you end up having some items that never show up and other items that show up twice, on two separate pages.

image

Page 2:

image

e.g. item 69

So back to the original user story... he / she opens the list of discount coupons, sorts it by start date (because that is an easy way for her / him to find it) and still ends up not finding the discount coupon, while others show up twice.

I get that this is an issue when you have duplicate field values. For me, the best practice when it was known that you could have duplicate values and it was important to have _consistent and predictable_ ordering (particularly in the case of pagination), I was always certain to include a secondary sort field to reduce ambiguity at the database level. That said, I just wanted to point out:

... but sort by ID ASC/DESC should always be added as the last sort criteria, no matter what.

I would have to disagree with this, since it I think it potentially adds an extra level of magic ✨ which is the last thing you want in an already complicated ORM. What happens when the ORM incorrectly parses/interprets the user's intent in their $default_sort property? How do we know they want older or newer records first (when there are dupes)? What happens if you need to use FIELD(...) or RAND(...) functions? I think it adds more issues than it solves, since:

  1. Adhere to best practice; when sorting order must be consistent, ensure you resolve ambiguity on your own.
  2. KISS principle: Let the database do the heavy lifting and reduce complexity as much as possible (increases transparency).

@patricknelson, I agree with you in theory and thank you for posting, but in practice this is not always so easy or practicable. My thinking right now is that we should add a Sort by ID ONLY when you limit a set, but I have not really found a solution yet.

Have you considered that, for example, by default the GridField allows the user to sort by all Summary Fields that are database fields? Thus, when you sort by a non-unique value in the GridField this bug will raise its ugly head, even though your default_sort may have included the ID field.

That's GridField, however, which allows you to select sort columns. I was replying to this (emphasis mine):

... but sort by ID ASC/DESC should always be added as the last sort criteria, no matter what.

... and then clarified that this shouldn't happen at the ORM (or framework) level when sorting in general. When you're in a GridField, I think it might be fine to enable that feature (possibly by default) and then maybe allow disabling of it via some extra setting on your GridField instance, if it gets in the way.

Hi Patrick,

Thank you for our reply. I think the first thing is really understand the issue, when does it happen, where does it happen, etc... ? I also jumped to the "solution" but I think the key is to understand the issue first. It does not seem to happen with all DataObjects, so I want to know, for example, what DataObjects are affected, when, etc... Its random presentation also makes it really easy to miss or not to worry about it where it may raise its ugly head when you least expect it.

Basically what I am saying: we are potentially presenting our users with bad data (some data multiple times, other data hidden) - this happens, for example, with GridField - out of the box. I think that is the crux. How we fix it is secondary (we may leave it to developers as suggested by core team). I want to fix it in all my sites, but I want to do it in a smart way.

It seems to me when you add DataObjects programatically (i.e. many at the same time) then it is more likely to happen. Also, I could not replicate it on Pages, why is that?

We've had a bit of a chat about this internally, and there's some consensus that a halfway measure to mitigate this is appropriate.

First, the condition of MySQL returning a non deterministic result set when ambiguous sort columns are used is absolutely a feature, not a bug. From the MySQL docs:

If multiple rows have identical values in the ORDER BY columns, the server is free to return those rows in any order, and may do so differently depending on the overall execution plan. In other words, the sort order of those rows is nondeterministic with respect to the nonordered columns. One factor that affects the execution plan is LIMIT, so an ORDER BY query with and without LIMIT may return rows in different orders.

While MySQL is providing the "expected result" (albeit non deterministic) in this case, it's fair to say DataList is not. As a implementor of the Sortable interface, DataList is expected to sort deterministically. Further, it is the job of the ORM to obscure the "how" from the user, and create the list in a consistent, uniform way, whether you're using MySQL, Postgres, etc, or just plain ArrayData.

All that said, to force ID into the sort is still too much magic, but ensuring default_sort is added seems like a reasonable halfway point. The problem is, with the direction being included as part of the default sort, it makes sort direction toggling (I.e. GridField column headers) unreliable. Therefore, a more sensible setting might be something more like default_sort_column, with null being allowed to disable forced deterministic sorts.

Thoughts?

Good thoughts ;-) I agree wit hall of that.

Here are a couple of thoughts I had:

Putting these two together ... whenever you use limit in an ORM call, you will have to sort by ID as the last sort fragment (e.g. turn ORDER BY "StartDate" DESC into ORDER BY "StartDate" DESC, "ID" ASC).

So, what I would recommend is: whenever we limit a recordset in an ORM call, we add sort by ID as the last sort phrase (with the ability to turn it off) by default. This is how I would do it:

When putting together the final SQL for a database query - IF:

  • limit is included AND
  • there is no sort by a database enforced unique field AND
  • if the database is MySQL (others???) AND
  • code does not specifically turns off this feature THEN
  • add sort by ID

Research questions:

  • why does this random sorting no occur on Pages - AFAIK?
  • does it also occur on other Databases (apart from MySQL)?
  • what is the time difference between SELECT * FROM Table AND SELECT * FROM Table ORDER BY ID - for a table with a few records and a table with a ton of records...? (I can test this)
  • is 4.0 also affected?
  • are there faster ways to ensure a reliable sort order other than adding ORDER BY ID
  • how we do we ensure that there is no ambiguous ID added (i.e. the baseclass / table is always added).

So the answer to most of your questions/concerns is that it's not a bug or oddity with MySQL, or Postgres, or pages, or DataObjects. This is something that's encoded in the SQL standard. The server is free to return the rows in any order it wants when identical values are encountered. That means, potentially, all database platforms are affected, and the way that this issue may manifest itself in various contexts, be it with the application of certain clauses, implementation of / version of storage engines, or just plain phases of the moon, is irrelevant. The point is that the SQL standard in this case tells us the result is nondeterministic, and we have an API that promises to be deterministic, so we need to reconcile that conflict somehow.

I think if we were to add a default_sort_column setting, it would have to be applied to all queries -- not just those that use LIMIT. If you had page of results that was all-inclusive, with ambiguous sort, that URL now becomes non-cacheable, as the results could in theory be returned differently on subsequent page loads. The goal should be to make DataList deterministic, not to negotiate the random behaviour of your database platform.

are we able to infer default_sort_column from getting the PrimaryKey type fields in DataObjectSchema::databaseFields()?

Probably a bit too much magic for my taste, but it's an idea to bounce around.

Since I think this is just in reference to DataObjects (where the default_sort_column setting will exist), the ID column should also always exist. Is it possible to change change a DataObject's primary key? I don't think so, but if it is, then maybe that'd be a concern.

How does it work in PHPMyAdmin?

I think we need to do more research. I found that only some tables return with duplicates between limited segments of the table, while other tables do not (e.g. SiteTree).

A challenge I've considered with a default_sort_column is that there needs to be a way to tell a datalist to reverse it in the case where a non-default sort column (E.g. Title) is reversed, otherwise groups of rows with the duplicate Title column would not reverse. That makes the solution not able to be applied so transparently.

IMPORTANT

A:

SELECT * FROM `SiteTree` LIMIT 20, 10 

B:

SELECT * FROM `SiteTree` LIMIT 20, 10 ORDER BY `ID`

C:

SELECT * FROM `SiteTree` LIMIT 20, 10 ORDER BY `sort`

NOT sorting and sorting by ID are equally fast, but sorting by Sort is significantly slower ... see: https://github.com/silverstripe/silverstripe-framework/issues/7272

This evidence shows that sort by ID vs no sort key has no performance impact and sorting by ID is faster, most of the time, than sorting by the default_sort.

Yeah it's not just in sorting in ASC/DESC but also if/when you're using functions to sort (e.g. FIELD(...) or IF(...) and maybe some others, mentioned his above). I don't know what the solution to this would be since this is known behavior, generally speaking. You could always suggest people include a secondary sort column in the meantime (separate issue, update doc's). Depending on how users apply sort, it may be possible to infer the direction if you apply ID as the second column. Often time it's just a second parameter passed to ->sort('Sort', 'DESC'). It's when you get into custom sort queries that his coildnt be parsed reliably (you could try) but I think at that level the dev should have more control with less magic.

Ask: Yeah @sunnysideup, typically sorting by indexed fields is a bit faster than unindexed.

OK, there's about twenty screenpages of discussion here ;) If you care strongly about resolution, can somebody please summarise the state of the discussion with options going forward?

Summary:

  • When an ambiguous sort column is used, the database is free to return the rows in any order it wants, making the operation non-deterministic.
  • This makes pagination unpredictable (e.g. repeated entries across pages)
  • As an implementor of Sortable, DataList _should_ sort deterministically.
  • Therefore, we should patch it
  • Forcing ID ASC into the query is too much magic. Something like default_sort_column would be more appropriate
  • The bug has existed at least since SS2, and no one raised it until now, so impact/low is fitting.

Forcing ID ASC into the query is too much magic. Something like default_sort_column would be more appropriate

I think we need to disambiguate the use (or the nomenclature) of default_sort_column since we already have default_sort which can define both one or more columns and their corresponding directions. As such, since we're addressing a specific problem (deterministic sorting) I think we should take a specific approach that makes it a bit more transparent, both in nomenclature and in implementation.

So, I'd suggest we do the following:

  • Utilize ID as our secondary sorting column failover. These are all DataObjects and, as such, all will contain this column.
  • Still add a new config option, but instead of default_sort_column (which is somewhat ambiguous in meaning), we use the name such as: sort_deterministic, sort_secondary_id, sort_failover or similar.

    • Note: I'd stay away from the default_sort_ prefix since we're not defaulting anything per se (as you'll see below). Instead, we're enabling/disabling a deterministic sorting algorithm at the database level, which (yes) does mean we have to inject extra fluff into the ORDER BY clause and (yes) this setting is enabled by default. However, this is a boolean (noted below).

  • Per above (and discussion): This should be an on/off setting and this should default to true.
  • Direction: We can automatically adjust the direction of secondary ID sort failover by defaulting to ASC in our algorithm and then parsing the existing order setting in our query prior to final execution for a direction and, if not blank/ASC, we can then set it to DESC.
  • Disabling/overriding: This can be done one of either one or two ways (uncertain on second):

    • Override the sort_deterministic setting and set it to false.

    • Parse final order clause for any secondary sort fields and, if they exist, remove.

    • NOTE 1: I'm unsure if this is necessary, as you can have as many additional sort columns as you like (as far as I'm aware) and a developer with the confidence of adding a secondary sort column may already have a very unique sort, so adding ID may be redundant for several reasons but not necessarily negative or have an impact on performance.

    • NOTE 2: Then again, if we are parsing the first sort column for direction, this is an argument for disabling once we see two columns, since we can't reliably determine what direction to sort in if we have multiple columns already provided, can we?

    • At minimum, if we retain ID even if a secondary column is manually provided, we may still want to do a quick de-dupe check to ensure this ID column isn't already specified (to prevent accidentally overriding a contrary direction, e.g. DESC instead of ASC).

Beyond the uncertainty of those two notes above on automatically disabling, I think we'd be set. Thoughts?

I agree with @patricknelson that having another config setting called default_sort_something is ambiguous and conflicts with the existing default_sort, but the idea is the same - to have a failover column to enable deterministic sorting.
Having ID as a default for whatever this setting is called is a good idea, as all DataObjects have this (maybe rare exceptions).

If the idea can be kept simple such as a single config private static $determined_sort = ['ID' => 'ASC']; and that can be extended/disabled as desired by developers through php or yml that would be good - yes you can set a config to false or null in yml in ss4.

I think some research may be useful:

  • what are the consequences of adding a sort by ID ASC to all sql statements (after any other sort statements - if any), so that you always have a predictable outcome? This would be my solution and apart from performance reductions, I can't see any other reason why this is not a good solution. [We may consider only doing this lists that are "limited").

  • I noticed that dataobjects do sort randomly but pages do not and I wonder why that is. It would be worthwhile to work out why (maybe to do with joins?)

Also - while no one has noticed this issues, this is not a benign one. At least two of my clients noticed it (and were unable to edit data because of it) in gridfields (where I had set the sort order to a field that was something like a date field).

Adding an index to SiteTree.Sort should be a no-brainer AFAIK because it will make all retrievals of Pages faster (whenever there is no custom sort) - see: https://dba.stackexchange.com/questions/11031/order-by-column-should-have-index-or-not.

Actually, what if we just had a default_sort_secondary field? This could not only be consistent with the existing default_sort field, but can be automatically defaulted to ID ASC (or be empty to retain existing functionality). Let me explain some scenarios and why I think this might work.

Functionality:

  • Default value for default_sort_secondary is ID ASC.

    • EDIT: Alternatively, this could be left empty and when default_sort_secondary is not defined (or no second column is defined in default_sort) we retain existing functionality.

  • Behind the scenes, the secondary column and direction are parsed separately before being handed to the database. This way, we can handle scenarios when the user begins to customize sorting. The sort direction is just an initial value that we start with and then _negate_ when the user negates the initial direction. That way if the user ever selects another primary sort column or ever chooses to negate the direction, we just negate our initial direction. No magic/mystery to this, just keep it logical. So, we're just shifting perspective. Logically, if you reverse the current sort, the secondary sort must also be reversed as well for consistency.
  • If default_sort is defined with a secondary sort value, that secondary sort will be used. e.g. Date ASC, Title ASC will imply a default_sort_secondary value of Title ASC.
  • Conflict resolution: If current sort column (e.g. Title) is equal to the above resolved secondary sort, secondary sort fails over to the ID column (since we know ID will always be unique).

This satisfies:

  • Nomenclature consistency (default_sort_secondary as a natural companion to default_sort). Useful for understanding internally how sorting is being handled and etc.
  • Sane initial defaults which are less magical (e.g. if the user sets up a secondary sort column in default_sort, this is parsed/retained in default_sort_secondary and functionality is extended).
  • Logical, consistent and deterministic results for pagination.

What do you guys think about that? Am I missing any scenarios or edge cases, possibly?

EDIT: Also, I propose that if we do this at all, at minimum it should be performed for GridField's because this is where it keeps arising for me (particularly due to the automation and framework level handling of user controlled sorting, making it difficult to modify/access without changing core code). That's only if we're not comfortable adding this to _every_ SELECT ... SQL query.

Any movement ?

Just skim read through this delightful thread. While the suggested feature would provide some benefit for some very specific use cases, it sounds like it's somewhat low value for most common scenarios. Those people who would benefit from this have a relatively easy alternative (just manually include a ID sort in their queries).

My instinct would be to close this issue has a "won't fix" since there's no clear agreed upon action to take.

Reread this (skimmed). That would be fine with me as it's still possible (as the developer) to manually define the secondary sort column via ::$default_sort. However, how do you (as SilverStripe) conclude this and resolve @unclecheese's point that DataList is intended to be deterministic, per https://github.com/silverstripe/silverstripe-framework/issues/7249#issuecomment-321949846?

That is ultimately what I think this issue morphed into, which makes it more broad and less about a specific use case like you'd find just in GridField pagination (or whatever). It's probably still low priority since while it may occur frequently, it's easy to fix (i.e. via best practices @tractorcow mentioned), but would still represent an important low level change, considering the breadth of it.

So to summarize: Is DataList deterministic, despite the underlying database layer or not?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kinglozzer picture kinglozzer  Â·  4Comments

maxime-rainville picture maxime-rainville  Â·  3Comments

chillu picture chillu  Â·  5Comments

ntd picture ntd  Â·  4Comments

dhensby picture dhensby  Â·  6Comments