Synapse: Cross-signing signatures not being always federated correctly

Created on 5 May 2020  路  14Comments  路  Source: matrix-org/synapse

Over the past couple of days I've seen a few occurrences of people on other servers verifying a new device, but my server not receiving the resulting signature, thus leaving them with a red shield from my pov. I've also seen people saying they were seeing others with a red shield whereas I would see them with a green one.

@bwindels had a look at my Riot logs upon failing to see one of @anoadragon453's devices as verified, and could see that it was indeed missing a signature, with the signatures on Andrew's device being:

"signatures": {
  "@andrewm:amorgan.xyz": {
    "ed25519:ZYMXYYQZTP": "aBAnNnO3Qzz8xMLXtSjnwDummc4F4vewygUBvdRKJWd2/326Se4knYkJkp/TKa0sXjeGKIUko8JXNorN2m9TDw",
    "ed25519:QjSD8srN17RiDzIBgzVbncj+NMdvDHRY4N2b8w+oq9Y": "FV5bCKEdp1q9SFvy7MzUgTj3SzAeHwuK4vvtgnIp/WFxdnYQ4XL2C64dW28oJsK4yJ95NkuFcL9uCTPaDKPoAw"
  }
}

and the ones my Riot would see for that same device being:

"signatures": {
  "@andrewm:amorgan.xyz": {
    "ed25519:ZYMXYYQZTP": "aBAnNnO3Qzz8xMLXtSjnwDummc4F4vewygUBvdRKJWd2/326Se4knYkJkp/TKa0sXjeGKIUko8JXNorN2m9TDw"
  }
}

Looking at the device_lists_remote_cache on my homeserver's database, I can see that it's indeed missing the signature from Andrew's self-signing key (ed25519:QjSD8srN17RiDzIBgzVbncj+NMdvDHRY4N2b8w+oq9Y).

https://github.com/matrix-org/riot-web-rageshakes/issues/2740#issuecomment-623992322 provides more info about that specific occurrence.

bug cross-signing p2 1

All 14 comments

Since this issue was opened, Andrew connected a new device to his account and verified it from Riot Web. I can now see ZYMXYYQZTP as verified, but not the new device.

afau two things could be happening here:

  • either we're using the wrong stream ID to send device updates and we're one ID short of the entry containing the new signature
  • or we don't correctly update the devices cache upon receiving an update

Right, I _think_ I've figured out what's happening here.

I've noticed that Andrew's device_lists_outbound_last_success table had a higher stream ID for my homeserver than the one mine had in its device_lists_remote_extremeties table. This likely means that the latest update was sent to my homeserver, that something happened while my homeserver was processing the update, causing it to abort it, but it still returned 200 to the Andrew's HS.

So I think the explanation is the following: when server A sends out a device list update to server B, server B will look at the updates' stream ids to see if it has missed any, if it did it will try to get the missing keys via /_matrix/federation/v1/user/devices/....

If that request fails, server B will:

  1. Not retry it (as pointed out by this comment)
  2. Abort the update (because otherwise Synapse forgets that it's out of sync and will never care about updates it might have missed)
  3. Return a 200 to server A (because it's returning here rather than raising an exception), causing server A to record that update as successful and not resend it ever

In that specific issue it looks like abolivier.bzh's /devices/... requests to amorgan.xyz timed out, which produced this failure mode:

2020-05-04 18:15:50,575 - synapse.federation.transport.server - 406 - INFO - PUT-1487720- Received txn 1588613504503 from amorgan.xyz. (PDUs: 0, EDUs: 1)
2020-05-04 18:15:50,581 - synapse.handlers.device - 625 - INFO - PUT-1487720- Received device list update for @andrewm:amorgan.xyz, requiring resync: True. Devices: ZYMXYYQZTP
2020-05-04 18:15:50,582 - synapse.http.matrixfederationclient - 408 - INFO - PUT-1487720- {GET-O-118918} [amorgan.xyz] Sending request: GET matrix://amorgan.xyz/_matrix/federation/v1/user/devices/%40andrewm%3Aamorgan.xyz; timeout 30.000000s
2020-05-04 18:15:50,584 - synapse.http.federation.matrix_federation_agent - 242 - INFO - PUT-1487720- Connecting to matrix.amorgan.xyz:8448
2020-05-04 18:16:20,588 - synapse.http.matrixfederationclient - 434 - INFO - PUT-1487720- Failed to send request: [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]
2020-05-04 18:16:20,589 - synapse.http.matrixfederationclient - 491 - WARNING - PUT-1487720- {GET-O-118918} [amorgan.xyz] Request failed: GET matrix://amorgan.xyz/_matrix/federation/v1/user/devices/%40andrewm%3Aamorgan.xyz: ResponseNeverReceived:[CancelledError()]
2020-05-04 18:16:20,589 - synapse.util.retryutils - 218 - INFO - PUT-1487720- Connection to amorgan.xyz was unsuccessful (<class 'synapse.api.errors.RequestSendFailed'>(Failed to send request: ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >])); backoff now 600000
2020-05-04 18:16:20,590 - synapse.handlers.device - 696 - WARNING - PUT-1487720- Failed to handle device list update for @andrewm:amorgan.xyz
2020-05-04 18:16:20,601 - synapse.access.http.8008 - 302 - INFO - PUT-1487720- 127.0.0.1 - 8008 - {amorgan.xyz} Processed request: 30.028sec/0.001sec (0.011sec, 0.001sec) (0.004sec/0.008sec/2) 12B 200 "PUT /_matrix/federation/v1/send/1588613504503 HTTP/1.1" "Synapse/1.12.4" [0 dbevts]

Thus server B is stuck with an outdated devices list that it'll never try to refresh, except if server A sends another update and isn't under too much load that it wouldn't time out on a /devices/... request at that specific time.

The correct fix here would either be to fail the whole transaction (which would also fail the PDUs so might not be the correct one) or have server B still send a 200 back and retry /devices/... periodically (which would give server A a wrong idea of the updates server B has successfully processed).

I believe the latter solution might be the most preferable one. I'm unfamiliar with the existing backoff/retry mechanism, so I might be forcing through an open door here, but we should absolutely persist that retry schedule in the database otherwise just restarting Synapse could lead to device lists staying out of sync.

woo, thanks for hunting this down. device list sync is the most wobbly bit of synapse, imo :(

device list sync is the most wobbly bit of synapse, imo :(

Agreed. Hopefully https://github.com/matrix-org/synapse/pull/7453 will make it way less wobbly.

Not sure if this is related, but I recently set up a new homeserver B and am missing everyone's master_keys and self_signing_keys except for people who have since set up their E2E cross signing keys.

When I compared a sample user in device_lists_remote_extremeties I found that their stream_id was much larger (59005256) than on my long-running homeserver A (26308567). Unfortunately I can't check the other end (as it's matrix.org).

I tested checking out #7453 and inserted their user_id into device_lists_remote_resync but after it was removed from the table I still didn't have master_keys or self_signing_keys for the user however I did notice a timeout entry for the user_id in the log:

2020-05-12 22:04:53,641 - synapse.http.matrixfederationclient - 408 - INFO -  - {GET-O-3986} [matrix.org] Sending request: GET matrix://matrix.org/_matrix/federation/v1/user/devices/<user_id>; timeout 30.000000s
2020-05-12 22:04:53,651 - synapse.handlers.presence - 343 - INFO - persist_presence_changes-1 - Persisting 3 unpersisted presence updates
2020-05-12 22:04:53,809 - synapse.http.matrixfederationclient - 164 - INFO -  - {GET-O-3986} [matrix.org] Completed: 200 OK
2020-05-12 22:04:53,810 - synapse.storage.database - 527 - WARNING -  - Starting db txn 'update_remote_device_list_cache' from sentinel context
2020-05-12 22:04:53,810 - synapse.storage.database - 566 - WARNING -  - Starting db connection from sentinel context: metrics will be lost

Not sure if your problem is the same issue - however thanks to your logs it's obvious that I didn't do the right thing wrt logging contexts in #7453, I'll try to fix that soon.

Glad I could help. I am currently stuck in this situation of missing many people's signing keys, let me know if there's other things I can check.

If you determine it's not the same we should reopen #7350.

I checked with one of my friend's running a homeserver. My server does not exist in their device_lists_outbound_last_success table, but their server's users do exist in my device_lists_remote_extremeties, though their keys aren't on my homeserver.

Is there any way to safely insert the keys into the database or force a refresh?

Hey, so after thinking about it, I don't think your issue share a common cause with this one (nor with #7350). I've opened https://github.com/matrix-org/synapse/issues/7504 to track its progress.

Is there any way to safely insert the keys into the database or force a refresh?

The only way I see of forcing a refresh would be to run https://github.com/matrix-org/synapse/pull/7453 and do the insertion mentioned at https://github.com/matrix-org/synapse/pull/7453#issuecomment-625824904, but if that doesn't solve it I can't think of anything else without more investigation.

Is there any way to safely insert the keys into the database or force a refresh?

The only way I see of forcing a refresh would be to run #7453 and do the insertion mentioned at #7453 (comment), but if that doesn't solve it I can't think of anything else without more investigation.

Thanks, sadly it didn't seem to solve the issue though I'll try again when this lands. I'm a little hesitant to run development versions of synapse against my regular HS since it seems to perform a non-reversible upgrade to the database (which I'm guessing could also get federation out of sync when I revert the db) and I'd like to be able to run the stable version ongoing, however if there are other things you'd like me to try I could set up a development HS. Feel free to reach out to me, @flackr:serializer.ca for higher bandwidth discussion.

Will do when I'll get to investigating it :+1:

For people getting bitten by this issue after 1.14.0 is out, or people seeing users with stale device lists from before 1.14.0, just run the following SQL in Synapse's database:

INSERT INTO device_lists_remote_resync
VALUES ('USER_ID', (EXTRACT(epoch FROM NOW()) * 1000)::BIGINT);

where USER_ID is the Matrix ID of the user whose device list has gone stale. I've tried this successfully on PostgreSQL, not yet on SQLite.

Was this page helpful?
0 / 5 - 0 ratings