Pkp-lib: Support iteration for DAO results

Created on 2 Oct 2019  路  20Comments  路  Source: pkp/pkp-lib

Currently the service classes often return arrays from getMany-type calls, which is convenient but requires all results to be loaded in-memory at once. This can be too resource-intensive, e.g. for long lists of submissions.

Instead, introduce a pattern for using an Iterator implementation. This will allow for foreach-based traversal with one-at-a-time instantiation.

All 20 comments

PRs:

@NateWr, mind taking a look? Hopefully this substantiates what I was raving about in Slack.

Tweaked your suggestions -- just missing your blessing for the greater approach :)

Ok, I've slept on it and here are my thoughts.

  1. Yes, let's go with an Iterator. Or a Generator if you think it will be more clear about the rewind limitation. I would say lets stick with what seems to be the most conventional approach in modern PHP.
  2. Let's change the name in the EntityReadInterface from getMany to getIterator or getGenerator. I think that will make it clearer that we're not getting a collection of objects.
  3. If we use a generator instead of an iterator, can we still use iterator_to_array()? If not, let's write a method like Generator::toArray() which we can use when needed.
  4. Once this is done, let's check each Hook to see where this might impact what's being passed down, and decide on a case-by-case basis how we want that Hook to be consumed: either a) convert the result set to an array, b) remove the result set completely so that they have to re-fetch the data, or c) move the position of the hook into the loop so that instead of passing a collection of objects the hook is fired separately for each object.

Does that sound alright? Thanks for talking me through all of this.

Yes, let's go with an Iterator. Or a Generator if you think it will be more clear about the rewind limitation. I would say lets stick with what seems to be the most conventional approach in modern PHP.

I think I'll stick with Iterator. The Generator interface extends Iterator anyway, so clearly the PHP dev team -- for what it's worth -- thinks non-rewindable Iterators are OK.

Let's change the name in the EntityReadInterface from getMany to getIterator or getGenerator. I think that will make it clearer that we're not getting a collection of objects.

I'm not so sure about this one; if we standardize to always returning iterators, and since iterators work with e.g. the foreach primitive, then I'm not sure it's worth disruptively removing getMany in favour of a new method name. It's true that we'd be changing the return value to a more restrictive type, but e.g. attempting to re-iterate [so to speak] will throw an Exception, so I think it's a lesser evil.

If we use a generator instead of an iterator, can we still use iterator_to_array()? If not, let's write a method like Generator::toArray() which we can use when needed.

Moot, since I'm OK with using an Iterator. But yes, Generators should still support iterator_to_array.

Once this is done, let's check each Hook...

Yes, can do.

Ok, I give in. Let's stick with getMany(). :smile:

Thanks, @NateWr! All merged.

@ajnyga, this issue in a nutshell (for the sake of PPS): Going forward, the Service class getMany functions should return an Iterator (specifically a DAOResultIterator) instead of an Array. You can do this by returning DAOResultFactory::toIterator() from your getMany implementation (in place of DAOResultFactory::toArray()).

The benefit is lower memory usage, since toArray involved instantiating all objects at once in an array. This approach only instantiates one at a time.

The downside is that you can only foreach through each return value once. (This is almost never required in the code, though.) To work around it, you can use:

$resultsAsArray = iterator_to_array($resultsIterator);

Attempting to foreach through a DAOResultIterator a second time will result in an exception.

@asmecher Do we need an OMP PR as well?

Also, sorry for the late code follow-up. I notice that the return statements all say Iterator:

@return Iterator

But should this say DAOResultIterator to be more precise? (My code editor will do code hinting/completion based on the return object type from a method.)

Do we need an OMP PR as well?

No -- OMP is fine as is. (Tests pass and some initial testing goes OK, though I wouldn't be surprised if we turned up one or two regressions during testing.)

But should this say DAOResultIterator to be more precise?

I changed Iterator to \Iterator; not sure if that helps already? DAOResultIterator doesn't expose any functionality beyond what Iterator does, so I would expect them to look the same in the hinting/completion tool.

Ok, good to know. I though DAOResultIterator might help us document that it doesn't rewind like Iterators do, but code completion is the main thing.

I found a few spots where !empty($result) is called on the DAOResuIterator or array_map was used without converting it to an array. I also renamed the variables to $result rather than, eg, $submissions, to help clarify the usage.

@asmecher can you take a look and let me know if this is alright?

PRs:
https://github.com/pkp/pkp-lib/pull/5206
https://github.com/pkp/ojs/pull/2512
https://github.com/pkp/omp/pull/717

@NateWr, good catches, thanks. What do you think about using the Countable interface and count($xyz) != 0 to replace the broken empty checks?

That can work too, and probably reads clearer than $result->valid(). We should be able to do just if (count($xyz)) right?

Yes, that'll do it. In the Countable implementation, you can use $this->_resultFactory->getCount().

All merged -- thanks, @NateWr! Personally I prefer e.g. $submissions over $results (more descriptive) but no big deal either way.

Personally I prefer e.g. $submissions over $results (more descriptive) but no big deal either way.

Yes, I'm torn on this as it seems easier to read with $submissions. But I also ran into exactly that confusion during the pair programming session because I made the assumption that I could call empty($submissions).

I'd be happy to also call it $submissionsIterator or something too, but I think, for myself at least, calling it $submissions will be a recurrent source of bugs from lack of attention.

Agreed, with a slight preference for $submissionsIterator (though it's long). I fear this stems from one of those legitimate criticisms of PHP.

However, if we're running a complex SELECT in order to just COUNT the results (or check if there were any), that's a wasteful pattern -- maybe there's some underlying work there.

However, if we're running a complex SELECT in order to just COUNT the results (or check if there were any), that's a wasteful pattern -- maybe there's some underlying work there.

Usually we're not just counting the results, but checking if there were 0 results before doing something else. (For example, when getting the list of issues, we check if there were any future issues before adding the "---Future Issues----" option to the select field.)

Yup, thinking about this further and seeing the change rolled out through the applications, this was definitely a worthwhile clarification. Thanks, @NateWr.

Was this page helpful?
0 / 5 - 0 ratings