Semanticmediawiki: UpdateDispatcherJob selecting too many subjects for forcedUpdate due to bug in line #132

Created on 15 Aug 2018  路  9Comments  路  Source: SemanticMediaWiki/SemanticMediaWiki

Setup and configuration

  • SMW version: 2.5.6
  • MW version: 1.27.4
  • PHP version: 5.5
  • DB system and version: MySQL

Issue

I believe I have found a bug in line #132 of UpdateDispatcherJob.php (see https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/2.5.x/src/MediaWiki/Jobs/UpdateDispatcherJob.php )

The bug is in the code that selects pages that need be updated because of a change in a semantically related page (i.e. they have a property pointing to the changed or deleted page). It results in (also) pages being selected that are not related to the changed or deleted page.

I believe the bug is caused by using the function 'getAllPropertySubjects( $property )' in line #132 of the code. This function is used to return all subjects with a specific value (a reference to the changed or deleted page) for the given property, but instead it returns all subjects with any value for that property (which is according to its specification; the function itself is correct, but it should not be used here). In wikis where page-type properties are used many times, this may lead to excessive updating of pages that are not impacted by a change.

Stack trace

Output from job execution for the example given below:

2018-08-15 12:29:11 SMW\UpdateDispatcherJob A semanticData={"subject":"A#0#","data":[{"property":"Has_pointer_to","dataitem":[{"type":9,"item":"A#0#"}]}],"serializer":"SMW\\Serializers\\SemanticDataSerializer","version":0.1} restricted.disp.pool=1 requestId=d5d127e27c90c194b1bb3db3 (id=29515,timestamp=20180815122905) STARTING
2018-08-15 12:29:11 SMW\UpdateDispatcherJob A semanticData={"subject":"A#0#","data":[{"property":"Has_pointer_to","dataitem":[{"type":9,"item":"A#0#"}]}],"serializer":"SMW\\Serializers\\SemanticDataSerializer","version":0.1} restricted.disp.pool=1 requestId=d5d127e27c90c194b1bb3db3 (id=29515,timestamp=20180815122905) t=27 good
2018-08-15 12:29:11 htmlCacheUpdate A table=pagelinks recursive=1 rootJobIsSelf=1 rootJobSignature=1720786a8ca246eb2bc0281026f90a2bcea197ea rootJobTimestamp=20180815122905 requestId=d5d127e27c90c194b1bb3db3 (id=29517,timestamp=20180815122905) STARTING
2018-08-15 12:29:11 htmlCacheUpdate A table=pagelinks recursive=1 rootJobIsSelf=1 rootJobSignature=1720786a8ca246eb2bc0281026f90a2bcea197ea rootJobTimestamp=20180815122905 requestId=d5d127e27c90c194b1bb3db3 (id=29517,timestamp=20180815122905) t=5 good
2018-08-15 12:29:11 htmlCacheUpdate A pages={"49061":[0,"B"]} rootJobSignature=1720786a8ca246eb2bc0281026f90a2bcea197ea rootJobTimestamp=20180815122905 requestId=d5d127e27c90c194b1bb3db3 (id=29519,timestamp=20180815122911) STARTING
2018-08-15 12:29:11 htmlCacheUpdate A pages={"49061":[0,"B"]} rootJobSignature=1720786a8ca246eb2bc0281026f90a2bcea197ea rootJobTimestamp=20180815122905 requestId=d5d127e27c90c194b1bb3db3 (id=29519,timestamp=20180815122911) t=5 good
2018-08-15 12:29:11 deleteLinks A pageId=49060 timestamp=20180815122905 requestId=d5d127e27c90c194b1bb3db3 (id=29514,timestamp=20180815122905) STARTING
2018-08-15 12:29:11 deleteLinks A pageId=49060 timestamp=20180815122905 requestId=d5d127e27c90c194b1bb3db3 (id=29514,timestamp=20180815122905) t=13 good
2018-08-15 12:29:11 recentChangesUpdate Speciaal:RecenteWijzigingen type=cacheUpdate requestId=d5d127e27c90c194b1bb3db3 (id=29516,timestamp=20180815122905) STARTING
2018-08-15 12:29:11 recentChangesUpdate Speciaal:RecenteWijzigingen type=cacheUpdate requestId=d5d127e27c90c194b1bb3db3 (id=29516,timestamp=20180815122905) t=4 good
2018-08-15 12:29:11 SMW\UpdateDispatcherJob UpdateDispatcherChunkedJobList::7848097235e0a464517e27093c1f2e5f job-list={"B#0##":true,"C#0##":true} requestId=d5d127e27c90c194b1bb3db3 (id=29518,timestamp=20180815122911) STARTING
2018-08-15 12:29:11 SMW\UpdateDispatcherJob UpdateDispatcherChunkedJobList::7848097235e0a464517e27093c1f2e5f job-list={"B#0##":true,"C#0##":true} requestId=d5d127e27c90c194b1bb3db3 (id=29518,timestamp=20180815122911) t=3 good
2018-08-15 12:29:11 SMW\UpdateJob B forcedUpdate=1 requestId=d5d127e27c90c194b1bb3db3 (id=29520,timestamp=20180815122911) STARTING
2018-08-15 12:29:11 SMW\UpdateJob B forcedUpdate=1 requestId=d5d127e27c90c194b1bb3db3 (id=29520,timestamp=20180815122911) t=146 good
2018-08-15 12:29:11 SMW\UpdateJob C forcedUpdate=1 requestId=d5d127e27c90c194b1bb3db3 (id=29521,timestamp=20180815122911) STARTING
2018-08-15 12:29:11 SMW\UpdateJob C forcedUpdate=1 requestId=d5d127e27c90c194b1bb3db3 (id=29521,timestamp=20180815122911) t=17 good

Steps to reproduce

Consider a semantic wiki with these pages:

  • Page A, containing text "Hello, world!"
  • Page B, containing text "[[Has pointer to::A]]"
  • Page C, containing text "[[Has pointer to::D]]"

'Has pointer to' is a property of type Page.

When deleting page A, the SMWUpdateDispatcherJob will select subjects (pages) that refer to A by means of a property, and insert an SMWUpdateJob for each subject found -- in this example: page B -- for a 'forcedUpdate'.

However, the code incorrectly selects both pages B and C. C has no relation to A and should not be updated.

bug

Most helpful comment

Some quick notes:

Thanks a lot for reporting! Pinging @mwjames for commenting.

When deleting page A, the SMWUpdateDispatcherJob will select subjects (pages) that refer to A by means of a property, and insert an SMWUpdateJob for each subject found -- in this example: page B -- for a 'forcedUpdate'.
However, the code incorrectly selects both pages B and C. C has no relation to A and should not be updated.

There are some notes on a ticked that was opened by Wikia in regards
to UpdateDispatcherJob and the switch to the restricted.disp.pool.

In short calling, getAllPropertySubjects is the best effort we make
to fetch all potential connected entities that may have been assigned
to the now deleted entity (connected by properties we identified
before the actual delete). Since the work is done in background (and
we want the work to be done in background anything else is not
scalable, see #948) by the time the job is executed no active reference exists
as ID for the deleted entity (as in your example A) so calling
getPropertySubjects wouldn't allow us to fetch a list of entities
here because we no longer are able to find those pairs of matching IDs
since the entity A is already removed (or better is marked outdated)
to avoid any active reuse as well to make sure it reflects the actual
state of the ArticleDelete action for A.

If you wanted to filter entities that are not connected to A you
would need to do some extra work by making a literal comparison
instead of an ID related (as in case of getPropertySubjects) on
those values for all potential involved subjects with something like
this:

@@ -4,10 +4,11 @@ namespace SMW\MediaWiki\Jobs;

 use Hooks;
 use SMW\ApplicationFactory;
 use SMW\DIProperty;
 use SMW\DIWikiPage;
+use SMW\DataTypeRegistry;
 use Title;

 /**
  * Dispatcher to find and create individual UpdateJob instances for a specific
  * subject and its linked entities.
@@ -131,12 +132,47 @@ class UpdateDispatcherJob extends JobBase {

            if ( !$property->isUserDefined() ) {
                continue;
            }

+           // Before doing some work, make sure to only use page type properties
+           // as a means to generate a resource (job) action
+           $type = DataTypeRegistry::getInstance()->getDataItemByType(
+               $property->findPropertyTypeId()
+           );
+
+           if ( $type !== \SMWDataItem::TYPE_WIKIPAGE ) {
+               continue;
+           }
+
+           $subject_list = [];
+
+           // Best effort to find all entities to a property that was an in-coming
+           // property for the deleted subject
+           $subjects = $this->store->getAllPropertySubjects( $property );
+
+           // Identify the source as bases for comparison
+           $source = DIWikiPage::newFromTitle( $this->getTitle() );
+
+           foreach ( $subjects as $subject ) {
+
+               // Investigate which subjects have an actual connection to the
+               // now deleted subject
+               $dataItems = $this->store->getPropertyValues( $subject, $property );
+
+               foreach ( $dataItems as $dataItem ) {
+
+                   // Make a judgment based on a literal comparison for the
+                   // values assigned and the now deleted entity
+                   if ( $dataItem instanceof DIWikiPage && $dataItem->equals( $source ) ) {
+                       $subject_list[] = $subject;
+                   }
+               }
+           }
+
            $this->addUniqueSubjectsToUpdateJobList(
-               $this->store->getAllPropertySubjects( $property )
+               $subject_list
            );
        }
    }

PS: I haven't tested this (means I might have missed something or
additional optimization is possible) and I cannot elaborate on any
performance impact for the additional selects especially on large
lists but the execution is done in background and in a separate
transaction so that it should be workable.

On 8/15/18, Karsten Hoffmeyer notifications@github.com wrote:

Thanks a lot for reporting! Pinging @mwjames for commenting.

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues/3322#issuecomment-413191110

All 9 comments

Thanks a lot for reporting! Pinging @mwjames for commenting.

Some quick notes:

Thanks a lot for reporting! Pinging @mwjames for commenting.

When deleting page A, the SMWUpdateDispatcherJob will select subjects (pages) that refer to A by means of a property, and insert an SMWUpdateJob for each subject found -- in this example: page B -- for a 'forcedUpdate'.
However, the code incorrectly selects both pages B and C. C has no relation to A and should not be updated.

There are some notes on a ticked that was opened by Wikia in regards
to UpdateDispatcherJob and the switch to the restricted.disp.pool.

In short calling, getAllPropertySubjects is the best effort we make
to fetch all potential connected entities that may have been assigned
to the now deleted entity (connected by properties we identified
before the actual delete). Since the work is done in background (and
we want the work to be done in background anything else is not
scalable, see #948) by the time the job is executed no active reference exists
as ID for the deleted entity (as in your example A) so calling
getPropertySubjects wouldn't allow us to fetch a list of entities
here because we no longer are able to find those pairs of matching IDs
since the entity A is already removed (or better is marked outdated)
to avoid any active reuse as well to make sure it reflects the actual
state of the ArticleDelete action for A.

If you wanted to filter entities that are not connected to A you
would need to do some extra work by making a literal comparison
instead of an ID related (as in case of getPropertySubjects) on
those values for all potential involved subjects with something like
this:

@@ -4,10 +4,11 @@ namespace SMW\MediaWiki\Jobs;

 use Hooks;
 use SMW\ApplicationFactory;
 use SMW\DIProperty;
 use SMW\DIWikiPage;
+use SMW\DataTypeRegistry;
 use Title;

 /**
  * Dispatcher to find and create individual UpdateJob instances for a specific
  * subject and its linked entities.
@@ -131,12 +132,47 @@ class UpdateDispatcherJob extends JobBase {

            if ( !$property->isUserDefined() ) {
                continue;
            }

+           // Before doing some work, make sure to only use page type properties
+           // as a means to generate a resource (job) action
+           $type = DataTypeRegistry::getInstance()->getDataItemByType(
+               $property->findPropertyTypeId()
+           );
+
+           if ( $type !== \SMWDataItem::TYPE_WIKIPAGE ) {
+               continue;
+           }
+
+           $subject_list = [];
+
+           // Best effort to find all entities to a property that was an in-coming
+           // property for the deleted subject
+           $subjects = $this->store->getAllPropertySubjects( $property );
+
+           // Identify the source as bases for comparison
+           $source = DIWikiPage::newFromTitle( $this->getTitle() );
+
+           foreach ( $subjects as $subject ) {
+
+               // Investigate which subjects have an actual connection to the
+               // now deleted subject
+               $dataItems = $this->store->getPropertyValues( $subject, $property );
+
+               foreach ( $dataItems as $dataItem ) {
+
+                   // Make a judgment based on a literal comparison for the
+                   // values assigned and the now deleted entity
+                   if ( $dataItem instanceof DIWikiPage && $dataItem->equals( $source ) ) {
+                       $subject_list[] = $subject;
+                   }
+               }
+           }
+
            $this->addUniqueSubjectsToUpdateJobList(
-               $this->store->getAllPropertySubjects( $property )
+               $subject_list
            );
        }
    }

PS: I haven't tested this (means I might have missed something or
additional optimization is possible) and I cannot elaborate on any
performance impact for the additional selects especially on large
lists but the execution is done in background and in a separate
transaction so that it should be workable.

On 8/15/18, Karsten Hoffmeyer notifications@github.com wrote:

Thanks a lot for reporting! Pinging @mwjames for commenting.

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues/3322#issuecomment-413191110

Hi,

Thanks for the quick response! I did some testing and the fix works as expected.

In my original usecase, I have properties that are used on >1000 pages, but rarely more than five of those point to the deleted page. So the dispatcher job may take longer to run but that's a small price for preventing more than a thousand unnecessary update jobs to be executed. We'll do some more testing on this.

We'll do some more testing on this.

That would be appreciated and if possible I'd like to close this issue with the next weekend sprint given a favourable feedback on the suggested changes.

We have had the fix running on a test wiki this week. It works and we haven't seen any issues so it's a thumbs up from my side.

We have had the fix running on a test wiki this week. It works and we haven't seen any issues so it's a thumbs up from my side.

Very well then, I'll merge #3340 into master and you can make a PR for the 2.5 branch based on that if you want to see this to appear with 2.5.8.

Very well then, I'll merge #3340 into master and you can make a PR for the 2.5 branch based on that if you want to see this to appear with 2.5.8.

If it is just a matter of cherry-picking I could do it. If it needs more work ...

If it is just a matter of cherry-picking I could do it.

Simply cherry-picking should work.

Simply cherry-picking should work.

Thanks for the info. Back-ported to 2.5.x with d9b3bcf8fcc707de40a5a4ee0f8283b02d195aa7

Was this page helpful?
0 / 5 - 0 ratings

Related issues

seth2740 picture seth2740  路  3Comments

alex-mashin picture alex-mashin  路  4Comments

mwjames picture mwjames  路  3Comments

mwjames picture mwjames  路  3Comments

jaideraf picture jaideraf  路  3Comments