I believe I have found a bug in line #132 of UpdateDispatcherJob.php (see https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/2.5.x/src/MediaWiki/Jobs/UpdateDispatcherJob.php )
The bug is in the code that selects pages that need be updated because of a change in a semantically related page (i.e. they have a property pointing to the changed or deleted page). It results in (also) pages being selected that are not related to the changed or deleted page.
I believe the bug is caused by using the function 'getAllPropertySubjects( $property )' in line #132 of the code. This function is used to return all subjects with a specific value (a reference to the changed or deleted page) for the given property, but instead it returns all subjects with any value for that property (which is according to its specification; the function itself is correct, but it should not be used here). In wikis where page-type properties are used many times, this may lead to excessive updating of pages that are not impacted by a change.
Output from job execution for the example given below:
2018-08-15 12:29:11 SMW\UpdateDispatcherJob A semanticData={"subject":"A#0#","data":[{"property":"Has_pointer_to","dataitem":[{"type":9,"item":"A#0#"}]}],"serializer":"SMW\\Serializers\\SemanticDataSerializer","version":0.1} restricted.disp.pool=1 requestId=d5d127e27c90c194b1bb3db3 (id=29515,timestamp=20180815122905) STARTING
2018-08-15 12:29:11 SMW\UpdateDispatcherJob A semanticData={"subject":"A#0#","data":[{"property":"Has_pointer_to","dataitem":[{"type":9,"item":"A#0#"}]}],"serializer":"SMW\\Serializers\\SemanticDataSerializer","version":0.1} restricted.disp.pool=1 requestId=d5d127e27c90c194b1bb3db3 (id=29515,timestamp=20180815122905) t=27 good
2018-08-15 12:29:11 htmlCacheUpdate A table=pagelinks recursive=1 rootJobIsSelf=1 rootJobSignature=1720786a8ca246eb2bc0281026f90a2bcea197ea rootJobTimestamp=20180815122905 requestId=d5d127e27c90c194b1bb3db3 (id=29517,timestamp=20180815122905) STARTING
2018-08-15 12:29:11 htmlCacheUpdate A table=pagelinks recursive=1 rootJobIsSelf=1 rootJobSignature=1720786a8ca246eb2bc0281026f90a2bcea197ea rootJobTimestamp=20180815122905 requestId=d5d127e27c90c194b1bb3db3 (id=29517,timestamp=20180815122905) t=5 good
2018-08-15 12:29:11 htmlCacheUpdate A pages={"49061":[0,"B"]} rootJobSignature=1720786a8ca246eb2bc0281026f90a2bcea197ea rootJobTimestamp=20180815122905 requestId=d5d127e27c90c194b1bb3db3 (id=29519,timestamp=20180815122911) STARTING
2018-08-15 12:29:11 htmlCacheUpdate A pages={"49061":[0,"B"]} rootJobSignature=1720786a8ca246eb2bc0281026f90a2bcea197ea rootJobTimestamp=20180815122905 requestId=d5d127e27c90c194b1bb3db3 (id=29519,timestamp=20180815122911) t=5 good
2018-08-15 12:29:11 deleteLinks A pageId=49060 timestamp=20180815122905 requestId=d5d127e27c90c194b1bb3db3 (id=29514,timestamp=20180815122905) STARTING
2018-08-15 12:29:11 deleteLinks A pageId=49060 timestamp=20180815122905 requestId=d5d127e27c90c194b1bb3db3 (id=29514,timestamp=20180815122905) t=13 good
2018-08-15 12:29:11 recentChangesUpdate Speciaal:RecenteWijzigingen type=cacheUpdate requestId=d5d127e27c90c194b1bb3db3 (id=29516,timestamp=20180815122905) STARTING
2018-08-15 12:29:11 recentChangesUpdate Speciaal:RecenteWijzigingen type=cacheUpdate requestId=d5d127e27c90c194b1bb3db3 (id=29516,timestamp=20180815122905) t=4 good
2018-08-15 12:29:11 SMW\UpdateDispatcherJob UpdateDispatcherChunkedJobList::7848097235e0a464517e27093c1f2e5f job-list={"B#0##":true,"C#0##":true} requestId=d5d127e27c90c194b1bb3db3 (id=29518,timestamp=20180815122911) STARTING
2018-08-15 12:29:11 SMW\UpdateDispatcherJob UpdateDispatcherChunkedJobList::7848097235e0a464517e27093c1f2e5f job-list={"B#0##":true,"C#0##":true} requestId=d5d127e27c90c194b1bb3db3 (id=29518,timestamp=20180815122911) t=3 good
2018-08-15 12:29:11 SMW\UpdateJob B forcedUpdate=1 requestId=d5d127e27c90c194b1bb3db3 (id=29520,timestamp=20180815122911) STARTING
2018-08-15 12:29:11 SMW\UpdateJob B forcedUpdate=1 requestId=d5d127e27c90c194b1bb3db3 (id=29520,timestamp=20180815122911) t=146 good
2018-08-15 12:29:11 SMW\UpdateJob C forcedUpdate=1 requestId=d5d127e27c90c194b1bb3db3 (id=29521,timestamp=20180815122911) STARTING
2018-08-15 12:29:11 SMW\UpdateJob C forcedUpdate=1 requestId=d5d127e27c90c194b1bb3db3 (id=29521,timestamp=20180815122911) t=17 good
Consider a semantic wiki with these pages:
'Has pointer to' is a property of type Page.
When deleting page A, the SMWUpdateDispatcherJob will select subjects (pages) that refer to A by means of a property, and insert an SMWUpdateJob for each subject found -- in this example: page B -- for a 'forcedUpdate'.
However, the code incorrectly selects both pages B and C. C has no relation to A and should not be updated.
Thanks a lot for reporting! Pinging @mwjames for commenting.
Some quick notes:
Thanks a lot for reporting! Pinging @mwjames for commenting.
When deleting page A, the SMWUpdateDispatcherJob will select subjects (pages) that refer to A by means of a property, and insert an SMWUpdateJob for each subject found -- in this example: page B -- for a 'forcedUpdate'.
However, the code incorrectly selects both pages B and C. C has no relation to A and should not be updated.
There are some notes on a ticked that was opened by Wikia in regards
to UpdateDispatcherJob and the switch to the restricted.disp.pool.
In short calling, getAllPropertySubjects is the best effort we make
to fetch all potential connected entities that may have been assigned
to the now deleted entity (connected by properties we identified
before the actual delete). Since the work is done in background (and
we want the work to be done in background anything else is not
scalable, see #948) by the time the job is executed no active reference exists
as ID for the deleted entity (as in your example A) so calling
getPropertySubjects wouldn't allow us to fetch a list of entities
here because we no longer are able to find those pairs of matching IDs
since the entity A is already removed (or better is marked outdated)
to avoid any active reuse as well to make sure it reflects the actual
state of the ArticleDelete action for A.
If you wanted to filter entities that are not connected to A you
would need to do some extra work by making a literal comparison
instead of an ID related (as in case of getPropertySubjects) on
those values for all potential involved subjects with something like
this:
@@ -4,10 +4,11 @@ namespace SMW\MediaWiki\Jobs;
use Hooks;
use SMW\ApplicationFactory;
use SMW\DIProperty;
use SMW\DIWikiPage;
+use SMW\DataTypeRegistry;
use Title;
/**
* Dispatcher to find and create individual UpdateJob instances for a specific
* subject and its linked entities.
@@ -131,12 +132,47 @@ class UpdateDispatcherJob extends JobBase {
if ( !$property->isUserDefined() ) {
continue;
}
+ // Before doing some work, make sure to only use page type properties
+ // as a means to generate a resource (job) action
+ $type = DataTypeRegistry::getInstance()->getDataItemByType(
+ $property->findPropertyTypeId()
+ );
+
+ if ( $type !== \SMWDataItem::TYPE_WIKIPAGE ) {
+ continue;
+ }
+
+ $subject_list = [];
+
+ // Best effort to find all entities to a property that was an in-coming
+ // property for the deleted subject
+ $subjects = $this->store->getAllPropertySubjects( $property );
+
+ // Identify the source as bases for comparison
+ $source = DIWikiPage::newFromTitle( $this->getTitle() );
+
+ foreach ( $subjects as $subject ) {
+
+ // Investigate which subjects have an actual connection to the
+ // now deleted subject
+ $dataItems = $this->store->getPropertyValues( $subject, $property );
+
+ foreach ( $dataItems as $dataItem ) {
+
+ // Make a judgment based on a literal comparison for the
+ // values assigned and the now deleted entity
+ if ( $dataItem instanceof DIWikiPage && $dataItem->equals( $source ) ) {
+ $subject_list[] = $subject;
+ }
+ }
+ }
+
$this->addUniqueSubjectsToUpdateJobList(
- $this->store->getAllPropertySubjects( $property )
+ $subject_list
);
}
}
PS: I haven't tested this (means I might have missed something or
additional optimization is possible) and I cannot elaborate on any
performance impact for the additional selects especially on large
lists but the execution is done in background and in a separate
transaction so that it should be workable.
On 8/15/18, Karsten Hoffmeyer notifications@github.com wrote:
Thanks a lot for reporting! Pinging @mwjames for commenting.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues/3322#issuecomment-413191110
Hi,
Thanks for the quick response! I did some testing and the fix works as expected.
In my original usecase, I have properties that are used on >1000 pages, but rarely more than five of those point to the deleted page. So the dispatcher job may take longer to run but that's a small price for preventing more than a thousand unnecessary update jobs to be executed. We'll do some more testing on this.
We'll do some more testing on this.
That would be appreciated and if possible I'd like to close this issue with the next weekend sprint given a favourable feedback on the suggested changes.
We have had the fix running on a test wiki this week. It works and we haven't seen any issues so it's a thumbs up from my side.
We have had the fix running on a test wiki this week. It works and we haven't seen any issues so it's a thumbs up from my side.
Very well then, I'll merge #3340 into master and you can make a PR for the 2.5 branch based on that if you want to see this to appear with 2.5.8.
Very well then, I'll merge #3340 into master and you can make a PR for the 2.5 branch based on that if you want to see this to appear with 2.5.8.
If it is just a matter of cherry-picking I could do it. If it needs more work ...
If it is just a matter of cherry-picking I could do it.
Simply cherry-picking should work.
Simply cherry-picking should work.
Thanks for the info. Back-ported to 2.5.x with d9b3bcf8fcc707de40a5a4ee0f8283b02d195aa7
Most helpful comment
Some quick notes:
There are some notes on a ticked that was opened by Wikia in regards
to
UpdateDispatcherJoband the switch to therestricted.disp.pool.In short calling,
getAllPropertySubjectsis the best effort we maketo fetch all potential connected entities that may have been assigned
to the now deleted entity (connected by properties we identified
before the actual delete). Since the work is done in background (and
we want the work to be done in background anything else is not
scalable, see #948) by the time the job is executed no active reference exists
as ID for the deleted entity (as in your example
A) so callinggetPropertySubjectswouldn't allow us to fetch a list of entitieshere because we no longer are able to find those pairs of matching IDs
since the entity
Ais already removed (or better is marked outdated)to avoid any active reuse as well to make sure it reflects the actual
state of the
ArticleDeleteaction forA.If you wanted to filter entities that are not connected to
Ayouwould need to do some extra work by making a literal comparison
instead of an ID related (as in case of
getPropertySubjects) onthose values for all potential involved subjects with something like
this:
PS: I haven't tested this (means I might have missed something or
additional optimization is possible) and I cannot elaborate on any
performance impact for the additional selects especially on large
lists but the execution is done in background and in a separate
transaction so that it should be workable.
On 8/15/18, Karsten Hoffmeyer notifications@github.com wrote: