I tried run yii\db\Query::batch()
for big table and got memory leak. This happens because method \yii\db\Command::queryInternal() run
$this->pdoStatement->execute();
I changed attribute PDO::MYSQL_ATTR_USE_BUFFERED_QUERY
-> false
and it helps me.
I think, we should add description about this feature into documentation.
Could you specify where did you set the attribute?
I tried it in config and runtime(Yii::$app->db->pdo->setAttribute(PDO::MYSQL_ATTR_USE_BUFFERED_QUERY,false)), and got error 2014.
As I understand this problem, you did not use 'fetch' or 'execute' method after creation request.
Also you can try the next solutions before I will see my code in the evening:
$command->reset()
Nevertheless, you can attach code and error stack to better understand your problem.
Other way, you can use setAttribute(PDO::MYSQL_ATTR_USE_BUFFERED_QUERY,false)
for mehod batch()
and after that return back attribute to true
for other cases
Unfortunately, I cannot find my codes with this attribute, maybe it was a previous project. But you may use advice indicated above or show your code and we will try to help.
unable to reproduce the issue, see #11152
Any Update on this issue?
I can reproduce the issue, working on a fix.
What is the opinion of the yii team concerning configurability?
One could argue that it never makes sense to use batch in combination with buffered mode and that the default mode should be unbuffered.
Alternatively we could make the default unchanged and just add a parameter that allows explicitly disabling buffered mode (meaning it won't break BC).
Currently testing with unbuffered mode, there are some issues:
SHOW FULL COLUMNS FROM xx
).I've just made a proof of concept that circumvents this issue, I'd love some feedback on it:
When running an unbuffered query, clone the database connection and use the cloned connection for the unbuffered query.
The advantage of this approach is that we can guarantee there are no issues with other queries (since the cloned connection is not accessible outside the scope of the batch query object(s)).
The downside is that it costs an extra connection (meaning connection set-up delay). Also it could give strange behavior when combined with transactions.
Also, unbuffered mode will not work with relations that are not loaded through explicit joins since these require separate queries.
We can conclude several things:
It doesn't make sense to batch in buffered mode. That is for sure. Issues with unbuffered mode are significant though...
Does it even make sense to have batch()
as a function? I always assumed it would actually get the results one batch at a time (reducing delays caused by roundtrips), but this does not seem to be the case.
Using ->each()
is just as efficient at the moment..
We could create a separate unbufferedBatch
that comes with appropriate warnings (I am willing to implement it), that just clones the connection and uses that.
Not sure this needs to be in the core though, it could just as well be released as a separate yii2-big-data
extension.
Probably not considering the difficulties.
I'd deprecate batch()
altogether...
@klimov-paul, @cebe, @SilverFire, @arogachev opinions?
One thing to consider is batch() in https://github.com/yiisoft/yii2-elasticsearch. @beowulfenator would you please check if batch()
is redundant in ES extension or not?
In the ES extension batch()
works great, and it is in fact the only way to get large result sets because of the way ES results are fetched. (In ES query API deep pagination is extremely costly and batch()
and each()
use scan/scroll API).
As for MySQL, I routinely use each()
and batch()
to fetch huge amounts of data. My typical use case is as follows:
$unbufferedDb = new \yii\db\Connection([
'dsn' => Yii::$app->db->dsn,
'username' => Yii::$app->db->username,
'password' => Yii::$app->db->password,
'charset' => Yii::$app->db->charset,
]);
$unbufferedDb->open();
$unbufferedDb->pdo->setAttribute(\PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, false);
foreach ($query->batch(1000, $unbufferedDb) as $batch) {
//...
}
To me batch()
and each()
are useful and interchangeable in the sense that sometimes I need to iterate over entries, and sometimes over batches of entries. When I iterate over batches, I tend to print a dot every time a batch is processed to give users a sense of progress.
I believe you should not deprecate these methods, just explain in the docs that they will fail if the result set is too large for MySQL's internal buffer.
@beowulfenator If you use unbuffered mode do you manually load the database schema before querying the models?
Otherwise you'll run into problems.
Also the current batch
without unbuffered mode basically just eats your servers memory while ignoring the PHP memory limit.
If we choose not to deprecate batch
in the RDBMS connections it could make sense to create a better default implementation.
For example:
As said before these could also be implemented in a plugin. In that case the functions in the default implementation could just be left unchanged (and in the future throw an exception pointing to the fact that an extension is needed).
Note that technically the functionality works just fine as it is now, it just offers no real benefits over ->all()
.
@beowulfenator are there cases when you prefer batch()
to each()
?
We could support that via a different construct, something like an iterator buffer.
Something like this: https://github.com/nikic/iter/blob/master/src/iter.php#L748
@samdark I use batch to indicate progress:
foreach ($query->batch(1000, $unbufferedDb) as $batch) {
//...
foreach ($batch as $model) {
//...
}
echo '.';
}
@SamMousa I've never had any problems with loading the schema because I'm using a separate unbuffered connection for the big query.
Perhaps I misunderstand the problem, but to me it seems that on the PHP side batch()
and each()
work exactly as intended. We execute a SQL statement, get an iterator that points to the result set, and proceed with reading the data. So if you specify batch size of 1000 and PHP doesn't have enough RAM to load 1000 models, that's not really an implementation issue. This is distinctly different from all()
which will load all models into RAM, and not just a specified chunk.
The problem with batch()
in MySQL is on the server side, because MySQL tries to load the whole result set into its buffer. It may or may not be true for other DBMS. As far as I know, there is no way to query a very large result set in MySQL with a buffered query. But this is a MySQL-specific caveat that should be described in the docs. There's no point deprecating batch()
because of it.
@beowulfenator No, it's a client side buffer. The MySQL driver will load all data locally regardless if your batching (in buffered mode).
The only difference is that the drivers memory use does not count towards the php memory_limit
setting. This actually makes the problem bigger since it allows PHP to use memory until the process gets killed by the OS.
So with a batch size of 1 your PHP process still can crash.
My point is, if your data set fits in memory there is no reason for "batching it", as opposed to copy it to PHP directly (using all()
). If it doesnt fit in memory then batching doesn't help.
Regarding deprecation, I agree that that is not the right solution since it is part of the contract and extensions implement it as well. There are alternatives though, such as not implementing it. As it stands now, it doesnt actually work well if you do proper memory usage measurements.
Regarding your specific use case, you could just as well do:
foreach($q->all() as $i => $m) {
...
$i % 1000 || echo '.';
}
Of course that's not really relevant, there are plenty of solution to that problem.
In my opinion batch
and each
are for iterating over large results sets in an efficient manner. The default implementation for RDBMSes doesn't really do that.
What if we had a final scenario where it went like this (not sure about the route to get there):
->batch()
.yii2-big-data
@SamMousa OK, so then there is _also_ a client-side buffer. Because in my case selecting with an buffered query several million entries from a 10 gigabyte table in hopes of iterating them sequentially brings down the MySQL server first! (The server in question has 1 gigabyte of RAM).
Are you absolutely certain that PDO driver really loads the whole dataset into RAM locally if query is buffered?
Just checked, you're right, in buffered mode batching is useless.
I would argue against automatically disabling the transactions on the main connection and automatically creating unbuffered connections.
Transactions only create issues when they deal with the same table. This is not always the case. Automatically creating an unbuffered connection for every batch query is wasteful because then this connection can not be reused.
I think the problem can be formalized as follows: batch()
and each()
make no sense in buffered connections. How about this approach: in the core framework in the BatchQueryResult
class we check whether its db
property is an unbuffered connection, and if it's not, we throw a warning.
That way the user can either manually provide an unbuffered connection or use an extension.
Transactions only create issues when they deal with the same table. This is not always the case.
It might not always be the case but it can lead to hard to debug problems, think things like foreign key constraints and such, we could make it optional though (as an argument to batch).
Automatically creating an unbuffered connection for every batch query is wasteful because then this connection can not be reused.
When doing large queries where this kind of batching is required connection setup is minimal compared to total time spent, if you iterate over 1 million records the setup time of 1 vs 2 connection is irrelevant.
I think the problem can be formalized as follows: batch() and each() make no sense in buffered connections. How about this approach: in the core framework in the BatchQueryResult class we check whether its db property is an unbuffered connection, and if it's not, we throw a warning.
That could work; throwing a warning I assume means an exception?
We need to guarantee the database schema is known (otherwise the AR layer will try to do a SHOW FULL COLUMNS FROM xx).
@samdark I think because of this schema query this is actually a bug, not just a documentation issue.
I've just tried to do a batch query with ActiveRecord from a console command where I've set MYSQL_ATTR_USE_BUFFERED_QUERY,
and it breaks with a "Cannot execute queries while other unbuffered queries are active." error. So basically you can not use large batch queries with AR (unless the schema is accidentally cached - which we can not rely on).
UPDATE: My woraround right now is, to create a dummy instance of the AR model. This will trigger a schema query so that the main query will not be interrupted anymore:
@mikehaertl Just make two connections, and use the one that's unbuffered for batch.
@beowulfenator Thanks, but I think creating an empty AR instance is easier:
$dummy = new SomeModel(['id' => 0]);
SomeModel::getDb()->setAttribute(\PDO:MYSQL_ATTR_USE_BUFFERED_QUERY, false);
foreach (SomeModel::find()->each() as $model) {
...
}
Anyway, both are just workarounds. I think a proper fix should check in batch()
and each()
whether it's an active query, and trigger a schema query first.
Adding a second DB connection is something that the framework should not decide on it's own. There may be side effects in certain scenarios. As a developer I want to know, how many DB connections my app opens.
@mikehaertl The problem is that there is no reliable way to determine the required schemas. (There is an upper bound defined by any all relations that are selected for greedy loading).
This will get messy very quickly as soon as relations are involved since Yii by default uses a separate query for loading related records.
There is no solution that handles ALL cases adequately.
@SamMousa Hmm, i see. As there's no silver bullet we should at least make this very clear in the documenation. To sum up what's missing there right now:
batch()
or each()
on MySQL Queries, you have to set PDO::MYSQL_ATTR_USE_BUFFERED_QUERY
to false
ActiveQuery
, setting the above is not enough. You have to let Yii load the schemas first, e.g. by calling MyModel::getTableSchema()
on all AR classes involved in the query.That only applies if you have a dataset that does not fit in memory. (I personally do all iteration with batch
and each
whether the dataset is large or small)
It should be possible for for Yii to load all schemas before executing the active query, would need to take a closer look at this. If we refactor that thinks would improve:
That would resolve the "simple" cases where you just set unbuffered queries to true..
That only applies if you have a dataset that does not fit in memory.
I'd argue that this is the whole point of doing a batch()
query in the first place. You say you always use it - but for no reason. Or what is the benefit for you?
It expresses more explicitly what I need to do.
I need to iterate over results, not generate an array and iterate over the array.
Also there is the case where PHP has a memory limit, say 128M
, but the dataset is larger 150M
. Since it is very likely 150M
will fit in server memory but within the PHP memory limit, each
will work here.
The drivers' memory does not count towards PHPs memory limit.
Edit: of course that is also a good way to eat all your servers' memory and get your PHP-FPM process killed :)
@mikehaertl Buffered queries also require substantial amount of RAM on the server. The reason I started using unbuffered queries is the need to iterate over the whole 10 Gb table on a server with 1 Gb RAM. SELECT * FROM my_table
would simply bring the server down in buffered mode.
I think that the _simple_ and Ugly way to solve this is to add a pair of auto generated offset/limit parameters
to the query,it will decrease the size of result , which won't cause the memory issue.
every time when batch()
is called, the offset/limit parameters
should be adjust to add the $size
of batch().
the problem is that maybe an order by primary key
or something else should be done to keep the order of the dataset
@ihipop That's not a bad idea, actually. Not really ugly, and it is essentially what we do in DataProvider
with pagination. @samdark, what do you think about using offset/limit to implement batch()
and each()
?
It is a really bad idea. Test it, the performance will be terrible.
It will also not provide a consistent view of the data unless implemented
with severe locking effectively blocking all writes to the table(s) in
question.
On Apr 21, 2017 8:33 AM, "Konstantin Sirotkin" notifications@github.com
wrote:
@ihipop https://github.com/ihipop That's not a bad idea, actually. Not
really ugly, and it is essentially what we do in DataProvider with
pagination. @samdark https://github.com/samdark, what do you think
about using offset/limit to implement batch() and each()?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/yiisoft/yii2/issues/8420#issuecomment-296091844, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAhYzad6PguOPc2dpXXsix-jRZDSloN7ks5ryE2ogaJpZM4EamFy
.
OK.
@SamMousa the consistent view of the data
is as mentioned
the problem is that maybe an order by primary key or something else should be done to keep the order of the dataset
@ihipop It is not.
Let me explain all my arguments against your method.
1
and 15
what happens if I insert 2
during your iteration, most likely it'll be skipped or cause some other entry to be returned twice.An alternative, if you really want one is this, it is still far from optimal and only works for integer keys. We use this method in production for slow running jobs. And it only provides an upper bound for the numbe r of results.
where id > 10 and <= 20
. This approach is a lot faster but can still miss records inserted during processing. It will never return records twice.
As I've said before, no solution is valid for all cases and therefore we should not just implement something. Instead implement it yourself as a custom function or behavior in your query class.
@beowulfenator you mention buffered queries require RAM on the server; this is not clear to me. Do you mean the PHP or DB server?
@SamMousa I mean the DB server. Executing a buffered MySQL query with the result set that is too large to fit into MySQL server's RAM brings down the MySQL server. At least that's how it is for me.
Hmm, that seems strange to me, since the buffering is happening in the PDO driver as far as I know
Finally,I write a new static method to get a unBuffered Query PDO $db
instance to solve my Problem at acceptable price
public static function getUnbufferedMysqlDb($db, $db_identifier = null)
{
$db_string = '';
if (is_string($db)) { //TO SUPPORT the $db of Component Definition ID passed in string ,for example $db='db'
$db_string = $db;
if (empty($db_identifier)) {
$db_identifier = $db;
}
$db = Yii::$app->get($db); // Convert string Component Definition ID to a Component
}
if (!($db instanceof \yii\db\Connection) || !strstr($db->getDriverName(), 'mysql')) { //Safe Check
throw new InvalidParamException('Not a Mysql Component');
};
if (empty($db_identifier)) { //Generate a New String Component Definition ID if $db_identifier is not Provided
$db_identifier = md5(sprintf("%s%s%s%s", $db->dsn, $db->username, $db->password,
var_export($db->attributes, true)));
}
$db_identifier = 'unbuffered_' . $db_identifier;
if (!Yii::$app->has($db_identifier)) {
if ($db_string) {
$_unbuffered_db = Yii::$app->getComponents()[$db_string];//Clone a Configuration 、、克隆一个配置
$_unbuffered_db['attributes'][\PDO::MYSQL_ATTR_USE_BUFFERED_QUERY] = false;
} else {
$_ = clone $db;
$_->close(); //Ensure that it is not an active Connection Because PDO can not be serialize
/** @var $_unbuffered_db \yii\db\Connection */
$_unbuffered_db = unserialize(serialize($_)); //Clone a Expensive Object //deep copy for safe
$_unbuffered_db->attributes[\PDO::MYSQL_ATTR_USE_BUFFERED_QUERY] = false;
}
Yii::$app->setComponents([$db_identifier => $_unbuffered_db]);
}
return Yii::$app->get($db_identifier);
}
When Using ,pass it to the $db
parameters of batch()/each()
It Will establish a new Connection to mysql which PDO::MYSQL_ATTR_USE_BUFFERED_QUERY
is false
,and Only Used for batch()/each()
,Other query could be done as normal during batch()/each()
$unbuffered_db = SomeClass::getUnbufferedMysqlDb($Model::getDb());
foreach($activeQuery->batch(100,$unbuffered_db) as $foobars){
#foobars
}
Also this acceptable.
$unbuffered_db = SomeClass::getUnbufferedMysqlDb('some-db');
@mikehaertl
It may be worth keeping this feature but the guide is misleading
When working with large amounts of data, methods such as yiidb\Query::all() are not suitable because they require loading all data into the memory. To keep the memory requirement low, Yii provides the so-called batch query support. A batch query makes use of the data cursor and fetches data in batches.
And then proceeds to demonstrate the methods that fail to do this.
There needs to be a warning about memory exhaustion in PDO's buffered mode and advice how to avoid it.
Some elaboration on "makes use of the data cursor" would also be welcome.
@tom-- You're welcome to submit your PR for the documentation.
I believe this is the first time I have seen "PRs welcome" deployed in the Yii project.
I did not submit my PR because I do not have one. I was unable to author one because I don't understand the mechanisms Yii and its use of PDO well enough to explain the problem and advise on workarounds. I am still struggling to debug my own problem with it and the difficulty comes mostly from this lack of knowledge.
More expert contributors than myself have expressed differing opinion on what to do about this bug including scrap it, fix it, and keep it. My comment is simply that unless we scrap it pretty soon, the guide needs a fix. I also wanted to revive the discussion. I regret that now.
@tom-- https://github.com/yiisoft/yii2/issues/8420#issuecomment-301423833 just do as this.
it will solve your problem ,maybe it is not the best one,but it would work
@ihipop I tried your workaround now as my solution (i.e. "hack") did not work with with()
queries. Unfortunately I found that your solution suffers from the same problem. Say you have "product has many variations". And your batch query is for products using ->with('variations')
. Yii then performs a 2nd query for the related Variation records, using the main buffered db connection and an product_id IN ( ... )
condition with all the product ids.
Unfortunately it seesm like the list of these product ids is not reset for each batch. It gets bigger and bigger so that the same related records are queried again and again - plus the new ids on top.
So I still end up with an out of memory error.
UPDATE: Forget what I said - it was rather an issue with too many related records. I've decreased the batch size and it works now.
As a 6 year developer of Yii2, I feel surprised that the each() method can exhaust memory until today, that I query on a table with a column of MEDIUMTEXT.
To sum up, basically, there are three methods to handle iterating on each record in a query set problem:
One solution is to add more parameters to batch() and each(), the first is mode, indicating the implementation we need to use, another is specific parameters for the implementation, such as the index column.
But these heterogeneous parameters will pollute the batch() and each() function.
Another solution is to add several services for each implementation. The usage is like:
new RangeQueryIterator(User::find()->where(...), $batchSize))
It can decouple ActiveQuery from Iterator, which is preferred.
Thanks for ideas. @terabytesoftw should be interesting for Yii 3.
An example of RangeQueryIterator:
<?php
namespace ZhiShiQ\Yii\Db;
use yii\db\ActiveQuery;
use yii\db\BatchQueryResult;
class RangeQueryIterator implements \Iterator
{
private $query;
private $pk;
/**
* @var BatchQueryResult
*/
private BatchQueryResult $batchIterator;
private $_batch;
private $_key;
private $_value;
public function __construct(ActiveQuery $query, $batchSize = 100)
{
$this->query = $query;
$this->pk = ($query->modelClass)::primaryKey();
$this->batchIterator = (clone $query)->select($this->pk)->batch($batchSize);
}
/**
* Destructor.
*/
public function __destruct()
{
$this->reset();
}
public function reset()
{
$this->_batch = null;
$this->_value = null;
$this->_key = null;
$this->batchIterator->reset();
}
public function rewind()
{
$this->reset();
$this->next();
}
public function next()
{
if ($this->_batch === null || next($this->_batch) === false) {
$this->batchIterator->next();
$keyCols = array_map(function ($val) {
$arr = [];
foreach ($this->pk as $key) {
$arr[$key] = $val[$key];
}
return $arr;
}, $this->batchIterator->current());
$this->_batch = (clone $this->query)->andWhere(['IN', $this->pk, $keyCols])->all();
//TODO to ensure the order is the same as $keyCols
reset($this->_batch);
}
$this->_value = current($this->_batch);
if ($this->query->indexBy !== null) {
$this->_key = key($this->_batch);
} elseif (key($this->_batch) !== null) {
$this->_key = $this->_key === null ? 0 : $this->_key + 1;
} else {
$this->_key = null;
}
}
public function key()
{
return $this->_key;
}
public function current()
{
return $this->_value;
}
public function valid()
{
return !empty($this->_batch);
}
}
foreach(new RangeQueryIterator(User::find()->where(...), $batchSize)) as $user) {
//xxx
}
Most helpful comment
In the ES extension
batch()
works great, and it is in fact the only way to get large result sets because of the way ES results are fetched. (In ES query API deep pagination is extremely costly andbatch()
andeach()
use scan/scroll API).As for MySQL, I routinely use
each()
andbatch()
to fetch huge amounts of data. My typical use case is as follows:To me
batch()
andeach()
are useful and interchangeable in the sense that sometimes I need to iterate over entries, and sometimes over batches of entries. When I iterate over batches, I tend to print a dot every time a batch is processed to give users a sense of progress.I believe you should not deprecate these methods, just explain in the docs that they will fail if the result set is too large for MySQL's internal buffer.