The global backoff code is in `send_federation_transaction`, so we had
to switch to using this function instead of
`server_server::send_request` directly. This has the side effect of
introducing a timeout, which we previously didn't have for transactions.
This is handled by the server_backoff service now.
The previous implementation of backoff for remote device key queries
that we are removing had a bug where the failure counter was never reset
after a success. This caused grapevine to accumulate a larger error rate
for remote device key queries until it is restarted. This bug is not
present in the new global backoff implementation.
Only marking M_UNKNOWN errors as a hard failure if they are in the
standard error format is conservative, and might cause us to miss some
offline servers. For example, a server might configure a load balancer
to send a standard-looking { errcode: ..., ... } response when the
backend is down, with a custom errcode. We wouldn't catch this.
TODO: evaluate whether this comes up in practice by running the changes
on computer.surgery
Currently we have some exponential backoff logic scattered in different
locations, with multiple distinct bad implementations. We should
centralize backoff logic in one place and actually do it correctly.
This backoff logic is similar to synapse's implementation[1], with a
couple fixes:
- we wait until we observe 5 consecutive failures before we start
delaying requests, to avoid being sensitive to a small fraction of
failed requests on an otherwise healthy server.
- synapse's implementation is kinda similar to our "only increment the
failure count once per batch of concurrent requests" behavoir, where
they base the retry state written to the store on the state observed
at the beginning of the request, rather on the state observed at the
end of the request. Their implementation has a bug, where a success
will be ignored if a failure occurs in the same batch. We do not
replicate this bug.
Our parameter choices are significantly less aggressive than synapse[2], which
starts at 10m delay, has a multiplier of 2, and saturates at 4d delay.
[1]: 70b0e38603/synapse/util/retryutils.py
[2]: 70b0e38603/synapse/config/federation.py (L83)
This became a problem because #foundation-office:matrix.org has a
malformed create event with its `predecessor` set to a string instead of
a map.
The solution to this is, unfortunately, to do more shotgun parsing to
extract only the desired fields rather than trying to parse the entire
content every time. To prevent this kind of problem from happening
again, `RoomCreateEventContent` must only be used for creating new PDUs,
existing PDUs must be shotgun-parsed.
This gets rid of 3 instances of re-parsing the room version.
There's one place where we need the event ID of the room create event to
verify federation responses, so now we just look up the event ID at that
point instead.
And some supporting changes:
* crane: It removed its dependency on nixpkgs and made overrideToolchain
take a function for splicing reasons, but we're doing splicing
ourselves so we can just ignore the function argument. These changes
are in `flake.nix`.
* [NixOS/nixpkgs#347228][0]: linkerFor* were removed because the linker
no longer needs to be different in some edge cases. Based on the
diff of the PR that introduced this change, ccFor* are the proper
replacements. These changes are in `cross-compilation-env.nix` in the
compiler-and-linker-choosing section.
* [NixOS/nixpkgs#350299][1]: buildPlatform isn't at the top level
anymore, we have to go through stdenv now. These changes are in
`nix/shell.nix`.
* rocksdb: nixpkgs has 9.6.1 now so we need to upgrade our rust
library to use the matching version. These changes are in
`Cargo.toml`, `Cargo.lock`, `nix/pkgs/default/default.nix`, and
`cross-compilation-env.nix` in the linker flags section.
[0]: https://github.com/NixOS/nixpkgs/pull/347228
[1]: https://github.com/NixOS/nixpkgs/pull/350299
Flake lock file updates:
• Updated input 'attic':
'github:zhaofengli/attic/4dbdbee45728d8ce5788db6461aaaa89d98081f0' (2024-03-29)
→ 'github:zhaofengli/attic/48c8b395bfbc6b76c7eae74df6c74351255a095c' (2024-10-30)
• Updated input 'attic/crane':
'github:ipetkov/crane/7195c00c272fdd92fc74e7d5a0a2844b9fadb2fb' (2023-12-18)
→ 'github:ipetkov/crane/4c6c77920b8d44cd6660c1621dea6b3fc4b4c4f4' (2024-08-06)
• Updated input 'attic/flake-compat':
'github:edolstra/flake-compat/35bb57c0c8d8b62bbfd284272c928ceb64ddbde9' (2023-01-17)
→ 'github:edolstra/flake-compat/0f9255e01c2351cc7d116c072cb317785dd33b33' (2023-10-04)
• Added input 'attic/flake-parts':
'github:hercules-ci/flake-parts/8471fe90ad337a8074e957b69ca4d0089218391d' (2024-08-01)
• Added input 'attic/flake-parts/nixpkgs-lib':
follows 'attic/nixpkgs'
• Removed input 'attic/flake-utils'
• Updated input 'attic/nixpkgs':
'github:NixOS/nixpkgs/07262b18b97000d16a4bdb003418bd2fb067a932' (2024-03-25)
→ 'github:NixOS/nixpkgs/159be5db480d1df880a0135ca0bfed84c2f88353' (2024-09-11)
• Updated input 'attic/nixpkgs-stable':
'github:NixOS/nixpkgs/44733514b72e732bd49f5511bd0203dea9b9a434' (2024-03-26)
→ 'github:NixOS/nixpkgs/797f7dc49e0bc7fab4b57c021cdf68f595e47841' (2024-08-22)
• Added input 'attic/nix-github-actions':
'github:nix-community/nix-github-actions/e04df33f62cdcf93d73e9a04142464753a16db67' (2024-10-24)
• Added input 'attic/nix-github-actions/nixpkgs':
follows 'attic/nixpkgs'
• Updated input 'crane':
'github:ipetkov/crane/109987da061a1bf452f435f1653c47511587d919' (2024-05-24)
→ 'github:ipetkov/crane/498d9f122c413ee1154e8131ace5a35a80d8fa76' (2024-10-27)
• Removed input 'crane/nixpkgs'
• Updated input 'fenix':
'github:nix-community/fenix/b6fc5035b28e36a98370d0eac44f4ef3fd323df6' (2024-05-22)
→ 'github:nix-community/fenix/87b4d20f896c99018dde4702a9c6157b516f2a76' (2024-11-01)
• Updated input 'fenix/rust-analyzer-src':
'github:rust-lang/rust-analyzer/21ec8f523812b88418b2bfc64240c62b3dd967bd' (2024-05-19)
→ 'github:rust-lang/rust-analyzer/0ba893e1a00d92557ac91efb771d72eee36ca687' (2024-10-31)
• Updated input 'flake-utils':
'github:numtide/flake-utils/b1d9ab70662946ef0850d488da1c9019f3a9752a' (2024-03-11)
→ 'github:numtide/flake-utils/c1dfcf08411b08f6b8615f7d8971a2bfa81d5e8a' (2024-09-17)
• Updated input 'nix-filter':
'github:numtide/nix-filter/3342559a24e85fc164b295c3444e8a139924675b' (2024-03-11)
→ 'github:numtide/nix-filter/776e68c1d014c3adde193a18db9d738458cd2ba4' (2024-10-29)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/5710852ba686cc1fd0d3b8e22b3117d43ba374c2' (2024-05-21)
→ 'github:NixOS/nixpkgs/807e9154dcb16384b1b765ebe9cd2bba2ac287fd' (2024-10-29)
This will break backwards compatibility of configurations, but
ensures that a previously-configured setting won't get dropped
arbitrarily. Pretty much worth it, I think.