Cphalcon: [NFR][DISCUSSION] Eager loading

Created on 4 Sep 2018 · 9Comments · Source: phalcon/cphalcon

First of all, thanks to all contributors for their time and effort they invested in making Phalcon a great framework and improving it!

To the topic, eager loading was discussed in #2268, #2647 and mostly in #1117. Looking at those, one will see how people desperately want this feature. In #1117 it was rejected, @andresgutierrez was against it because of performance doubts and duplication of unneeded data and @sergeyklay closed it saying:

Eager Loading is a really bad idea popularized by frameworks that do not care about performance.

If you guys rejected that NFR saying that you don't have time or any other resource to introduce that, I'd be okay with it as I'm aware that this is open source project and as such, nobody is obligated to grant anyone's wishes.

But I disagree with the reasons you named for rejecting it. Or, maybe I'm not getting it. Anyhow, let me try to explain using a simple example: Imagine a common master-detail relationship scenario, like invoices for an example. Obviously, invoices consist of invoice_items which I want to present together alongside of invoice itself.

Invoice 1
- invoice_item
- invoice_item

Invoice 2
- invoice_item
- invoice_item
- invoice_item

Now imagine there are 50 or 100 of invoices. I know it doesn't make sense to do this for invoices, but perhaps it does for some other scenario. That's beyond this discussion. We obviously don't want to do following because of n+1 number of requests.

$invoices = Invoices::find();

foreach ($invoices as $invoice) {
    echo $invoice;

    foreach ($invoice->invoice_items as $invoice_item) {
        echo $invoice_item
    }
}

So people are asking for something like

$invoices = Invoices::findWith([], 'invoice_items');
// or
$invoices = Invoices::query()->with('invoice_items');

something which is present in other frameworks (which, admittedly, are not necessarily always good ones to lookup to)
```c#
// EntityFramework
var invoices = context.Invoices.Include(i => i.InvoiceItems).ToList();

```php
// yii2
$invoices = Invoices::find()->with('invoiceItems');

// laravel
$invoices = App\Invoices::with('invoiceItems')->get();

So, @andresgutierrez said that implementing this would harm performance because it would fetch duplicate fields and unnecessary data and that we should use PHQL instead.

So I have two questions:
1) How would you achieve that with PHQL, because it is not the same. In PHQL you would do something like:

$builder = ....
$results = $builder->columns('Invoice.*, InvoiceItem.*')
                   ->from('Invoice')
                   ->join('InvoiceItem')
                   ->getQuery()
                   ->execute();

foreach ($results as $result) {
     echo $result->invoice;
     echo $result->invoiceItem;
}

Output:

Invoice 1
- invoice item

Invoice 1
- invoice item

Invoice 2
- invoice item

Invoice 2
- invoice item

Invoice 2
- invoice item

So above doesn't work very well in this case and would require extra work on every fetch to achieve desired output and is not as straightforward as

$invoices = Invoice::query()->with('invoiceItem');

foreach ($invoices as $invoice) {
    echo $invoice;

    foreach ($invoice->invoice_items as $invoice_item) {
        echo $invoice_item
    }
}

Which leads me to
2) what @andresgutierrez said #1117 how eager loading would lead to results like this:

| _r_id | _r_name    | _r_type    | _r_year | _r_datetime         | _r_text | _AA0_id | _AA0_robots_id | _AA0_parts_id |
+-------+------------+------------+---------+---------------------+---------+---------+----------------+---------------+
|     1 | Robotina   | mechanical |    1972 | 1972-01-01 00:00:00 | text    |       1 |              1 |             1 | 
|     1 | Robotina   | mechanical |    1972 | 1972-01-01 00:00:00 | text    |       2 |              1 |             2 | 
|     1 | Robotina   | mechanical |    1972 | 1972-01-01 00:00:00 | text    |       3 |              1 |             3 | 
|     2 | Astro Boy  | mechanical |    1952 | 1952-01-01 00:00:00 | text    |    NULL |           NULL |          NULL | 
|     3 | Terminator | cyborg     |    2029 | 2029-01-01 00:00:00 | text    |    NULL |           NULL |          NULL | 
+-------+------------+------------+---------+---------------------+---------+---------+----------------+---------------+

Maybe I'm missing something, but isn't that the way SQL is supposed to work given it's relational DBMS? But for the sake of the argument, if output like that is problematic, wouldn't that make the solution @andresgutierrez proposed of using PHQL equally problematic? Because with PHQL you would get:

| invoice   | invoice_date | invoice_item |
+-----------+--------------+--------------+
| invoice 1 |  2018-01-01  |  product 10  |
| invoice 1 |  2018-01-01  |  product 11  |
| invoice 2 |  2018-11-12  |  product 20  |
| invoice 2 |  2018-11-12  |  product 21  |
| invoice 2 |  2018-11-12  |  product 22  |
+-----------+--------------+--------------+

I wouldn't call this a problem because it is expected output from a join.

Eager loading may have drawbacks, but so do many other things which found their way into the framework. It's a tradeoff, like always. One shoe does not fit everyone. It may be problematic in huge datasets, but it may be useful and completely valid in other cases. That should be left to developers to decide whether that type of loading makes sense for their use cases. Having that said, I strongly, strongly disagree with calling eager loading a bad idea.

I believe community would benefit from having this feature.

Thanks,
Stefan

stale

Source

crnjakovic

👍7 ❤1

Most helpful comment

I hope that I understood all comments correctly. I'm working with a lot of data and native SQL for a long time now. In most cases I realized, that JOINing data in native SQL will not only lead to duplicate data that has to be de-duplicated on client side (i.e. PHP), but also to performance issues.

In Phalcon I want to work with models not with raw data, so I accept the argument that data duplication caused by JOIN clause will be a problem. But, taking the invoices example there is a drawback. When iterating over invoices then every access on invoice_items will run a separate SELECT statement for the invoice_items that are related to the current invoice. That's horrible. Imagine I have 50 invoices and want to get its items, then there are 51 SELECTs executed.

What I have done in past, is collecting all invoice IDs and execute a single SELECT against database. Then I have two resultsets, one for invoices and one for invoice_items. As (in my case) all models have unique keys, it's easy to join this data inside PHP and not inside MySQL. That works clean and for me, it's faster than doing JOIN in SQL. I would love to see a solution where Phalcon does this natively, which means: query data for invoices, query all related invoice_items at once based on defined relation in model and join this data automatically.

mohahn on 12 Oct 2018

👍3

All 9 comments

I understand that you are trying to explain. I know that such feature can return nested outputs, that is especially good for API responses, like in Laravel with with(). But, there are no damn hidden magic at background, it's a simple data looping and adding nesting. It has some adequate wrapper, which is useful for beginners or those who does not care much about performance, because it adds extra layers of logic. Current wrapper can be extend or made by yourself without having it in core, like Laravel. And it will be well boxed for your project, because you wrote it for your needs...

Also there are no problem to make similar behaviour with models relation, in case you want nested arrays/objects output.

Add relation one to many, loop your invoices and add invoice items as you wish... Or add some generic logic for DRY.

Jeckerson on 4 Sep 2018

Eager loading may have drawbacks, but so do many other things which found their way into the framework. It's a tradeoff, like always. One shoe does not fit everyone. It may be problematic in huge datasets, but it may be useful and completely valid in other cases. That should be left to developers to decide whether that type of loading makes sense for their use cases. Having that said, I strongly, strongly disagree with calling eager loading a bad idea.

That's why it's not in core but in incubator.

Also returning full objects, like doing hydration only for dispaling, returning it as api response is bad idea overall.

Jurigag on 4 Sep 2018

@Jeckerson

But, there are no damn hidden magic at background, it's a simple data looping and adding nesting.

ORM aint magic either, it's wrapper around fetching, inserting, updating, deleting, relations and other kind of logic, isn't it? You can apply "it's not magic" for anything, really.

Also there are no problem to make similar behaviour with models relation, in case you want nested arrays/objects output.

I know we can, I even demonstrated it, but at what cost? N+1 queries?

How is having a method which will perform join query on related methods and return nested objects any different than doing a PHQL/raw SQL query like @andresgutierrez suggested and looping through that same result set and populating objects except it's offered out of the box in standardized manner like many desire? OR let me rephrase, how is latter any more performance friendlier? Especially given that PHQL will create whole objects just to NULL their properties? But that's a separate issue entirely.

That's why it's not in core but in incubator.

Also returning full objects, like doing hydration only for dispaling, returning it as api response is bad idea overall.

Why don't we then dump a whole bunch of features currently in core into the incubator too, starting with metadata memory adapter? Performance wise, it's really bad idea.

Why are we making phalcon so opinionated? Rather than dismissing a feature because in certain scenarios it may not be the best thing to do performance wise, why don't we add that tool to the toolbox and see if there is any way we can make it better and faster, so it becomes valid tool for more scenarios?

To me, the argument @Jurigag is advocating is equivalent to: "I don't think we should have any hammers around here, because striking an object with another one is generally a bad idea. But if you want to nail something, you can take that metal pipe and just whack it."

Look, I'm not a spoiled kid demanding features to be rolled out in an open source project because I need it, I'm just really convinced we are making a mistake dismissing this so easily with current arguments, and think this requires more inputs, discussion and revision.

crnjakovic on 4 Sep 2018

@scrnjakovic I got your point about out of box, but IMO, Phalcon isn't swiss knife which have all out-of-box...

Also,

So in some future, you would have possibility to write your own module with Eager loading feature...

Jeckerson on 4 Sep 2018

mohahn on 12 Oct 2018

👍3

@mohahn more or less.

You should keep in mind that IN () can also be problematic. You should also keep in mind that any kind of ORM will not completely erase the need to use what you call "native" SQL.

The issue I'm trying to point out here is that proposed "solutions" in other issues about the same feature contradict the arguments for rejecting eager loading. And that it's actually a valid tool.

It's not about replacing one thing with another, it's about having the right tools and using them appropriately.

crnjakovic on 12 Oct 2018

Thank you for contributing to this issue. As it has been 90 days since the last activity, we are automatically closing the issue. This is often because the request was already solved in some way and it just wasn't updated or it's no longer applicable. If that's not the case, please feel free to either reopen this issue or open a new one. We will be more than happy to look at it again! You can read more here: https://blog.phalconphp.com/post/github-closing-old-issues

stale[bot] on 10 Jan 2019

This issue no longer discussed and will never be realized?
100500 requests when reading an object with its additional data is very suboptimal.
And the use of pure SQL at each step negates the advantages of the models.