Commit graph

2726 commits

Author SHA1 Message Date
Olivia Lee
6ec2f57a26
add changelog entry for global offline server backoff 2024-11-16 20:52:43 -08:00
Olivia Lee
b9118b1361
replace outgoing transaction backoff with global offline server backoff
The global backoff code is in `send_federation_transaction`, so we had
to switch to using this function instead of
`server_server::send_request` directly. This has the side effect of
introducing a timeout, which we previously didn't have for transactions.
2024-11-16 20:52:43 -08:00
Olivia Lee
b876dca45c
add option for which errors will cause a backoff for outgoing requests 2024-11-16 20:52:43 -08:00
Olivia Lee
3f82676a81
add log_errors parameter to send_federation_request builder
This allows us to use send_federation_request for federation
transactions.
2024-11-16 20:52:43 -08:00
Olivia Lee
10d361e347
add builder struct for sending::send_federation_request
This will allow us to add optional parameters for things like error
logging and timeouts without cluttering call sites that use the
default.
2024-11-16 20:52:43 -08:00
Olivia Lee
5bc3fce257
log message when BackoffGuard is dropped without recording result 2024-11-16 20:52:43 -08:00
Olivia Lee
56f025cb47
metrics for online and offline remote server count 2024-11-16 20:52:43 -08:00
Olivia Lee
5b6aaa19b9
log when servers switch between online and offline 2024-11-16 20:52:43 -08:00
avdb13
080fe5af42
feat: configurable federation backoff 2024-11-16 20:52:13 -08:00
Olivia Lee
e1637d5f9a
remove remote device key query backoff
This is handled by the server_backoff service now.

The previous implementation of backoff for remote device key queries
that we are removing had a bug where the failure counter was never reset
after a success. This caused grapevine to accumulate a larger error rate
for remote device key queries until it is restarted. This bug is not
present in the new global backoff implementation.
2024-11-16 20:13:11 -08:00
Olivia Lee
8001dcf2eb
backoff from offline servers in all outgoing federation requests
Only marking M_UNKNOWN errors as a hard failure if they are in the
standard error format is conservative, and might cause us to miss some
offline servers. For example, a server might configure a load balancer
to send a standard-looking { errcode: ..., ... } response when the
backend is down, with a custom errcode. We wouldn't catch this.

TODO: evaluate whether this comes up in practice by running the changes
on computer.surgery
2024-11-16 20:13:11 -08:00
Olivia Lee
9b22c9b40b
add service for tracking backoffs to offline servers
Currently we have some exponential backoff logic scattered in different
locations, with multiple distinct bad implementations. We should
centralize backoff logic in one place and actually do it correctly.

This backoff logic is similar to synapse's implementation[1], with a
couple fixes:

 - we wait until we observe 5 consecutive failures before we start
   delaying requests, to avoid being sensitive to a small fraction of
   failed requests on an otherwise healthy server.
 - synapse's implementation is kinda similar to our "only increment the
   failure count once per batch of concurrent requests" behavoir, where
   they base the retry state written to the store on the state observed
   at the beginning of the request, rather on the state observed at the
   end of the request. Their implementation has a bug, where a success
   will be ignored if a failure occurs in the same batch. We do not
   replicate this bug.

Our parameter choices are significantly less aggressive than synapse[2], which
starts at 10m delay, has a multiplier of 2, and saturates at 4d delay.

[1]: 70b0e38603/synapse/util/retryutils.py
[2]: 70b0e38603/synapse/config/federation.py (L83)
2024-11-16 20:13:09 -08:00
Charles Hall
93c714a199
update changelog 2024-11-12 00:36:36 -08:00
Charles Hall
51b30d9ba3
largely stop using RoomCreateEventContent
This became a problem because  #foundation-office:matrix.org has a
malformed create event with its `predecessor` set to a string instead of
a map.

The solution to this is, unfortunately, to do more shotgun parsing to
extract only the desired fields rather than trying to parse the entire
content every time. To prevent this kind of problem from happening
again, `RoomCreateEventContent` must only be used for creating new PDUs,
existing PDUs must be shotgun-parsed.
2024-11-08 20:36:46 -08:00
Charles Hall
9d0cf428a5
stop passing the entire create event around
This gets rid of 3 instances of re-parsing the room version.

There's one place where we need the event ID of the room create event to
verify federation responses, so now we just look up the event ID at that
point instead.
2024-11-08 20:36:07 -08:00
Charles Hall
a4e1522875
generalize get_room_version
There are other fields of `m.room.create` events that are useful to
individually extract without caring about the values of other fields.
2024-11-08 18:38:16 -08:00
Charles Hall
c9c30fba30
upgrade to latest ruma 2024-11-08 17:26:44 -08:00
Charles Hall
2dbb101140
upgrade to latest rust-rocksdb
We gotta overrideAttrs to set the src and version to a newer version
than nixpkgs has now.
2024-11-08 17:26:38 -08:00
Charles Hall
d565b22da9
run cargo upgrade && cargo update
There are a few available incompatible updates not covered here:

* RocksDB
* Ruma
* OTel stuff
2024-11-07 11:36:13 -08:00
Charles Hall
e74c8687f5
drop dependency on once-cell 2024-11-07 11:36:13 -08:00
Charles Hall
a550d8db1f
assert that rocksdb versions match during eval 2024-11-07 11:36:13 -08:00
Charles Hall
9f4a1578aa
move rocksdb to its own file 2024-11-07 11:36:13 -08:00
Charles Hall
2e6bf86a42
move rust-jemalloc-sys to its own file 2024-11-07 11:36:13 -08:00
Charles Hall
f07c8c2b6f
update flake.lock
And some supporting changes:

* crane: It removed its dependency on nixpkgs and made overrideToolchain
  take a function for splicing reasons, but we're doing splicing
  ourselves so we can just ignore the function argument. These changes
  are in `flake.nix`.
* [NixOS/nixpkgs#347228][0]: linkerFor* were removed because the linker
  no longer needs to be different in some edge cases. Based on the
  diff of the PR that introduced this change, ccFor* are the proper
  replacements. These changes are in `cross-compilation-env.nix` in the
  compiler-and-linker-choosing section.
* [NixOS/nixpkgs#350299][1]: buildPlatform isn't at the top level
  anymore, we have to go through stdenv now. These changes are in
  `nix/shell.nix`.
* rocksdb: nixpkgs has 9.6.1 now so we need to upgrade our rust
  library to use the matching version. These changes are in
  `Cargo.toml`, `Cargo.lock`, `nix/pkgs/default/default.nix`, and
  `cross-compilation-env.nix` in the linker flags section.

[0]: https://github.com/NixOS/nixpkgs/pull/347228
[1]: https://github.com/NixOS/nixpkgs/pull/350299

Flake lock file updates:

• Updated input 'attic':
    'github:zhaofengli/attic/4dbdbee45728d8ce5788db6461aaaa89d98081f0' (2024-03-29)
  → 'github:zhaofengli/attic/48c8b395bfbc6b76c7eae74df6c74351255a095c' (2024-10-30)
• Updated input 'attic/crane':
    'github:ipetkov/crane/7195c00c272fdd92fc74e7d5a0a2844b9fadb2fb' (2023-12-18)
  → 'github:ipetkov/crane/4c6c77920b8d44cd6660c1621dea6b3fc4b4c4f4' (2024-08-06)
• Updated input 'attic/flake-compat':
    'github:edolstra/flake-compat/35bb57c0c8d8b62bbfd284272c928ceb64ddbde9' (2023-01-17)
  → 'github:edolstra/flake-compat/0f9255e01c2351cc7d116c072cb317785dd33b33' (2023-10-04)
• Added input 'attic/flake-parts':
    'github:hercules-ci/flake-parts/8471fe90ad337a8074e957b69ca4d0089218391d' (2024-08-01)
• Added input 'attic/flake-parts/nixpkgs-lib':
    follows 'attic/nixpkgs'
• Removed input 'attic/flake-utils'
• Updated input 'attic/nixpkgs':
    'github:NixOS/nixpkgs/07262b18b97000d16a4bdb003418bd2fb067a932' (2024-03-25)
  → 'github:NixOS/nixpkgs/159be5db480d1df880a0135ca0bfed84c2f88353' (2024-09-11)
• Updated input 'attic/nixpkgs-stable':
    'github:NixOS/nixpkgs/44733514b72e732bd49f5511bd0203dea9b9a434' (2024-03-26)
  → 'github:NixOS/nixpkgs/797f7dc49e0bc7fab4b57c021cdf68f595e47841' (2024-08-22)
• Added input 'attic/nix-github-actions':
    'github:nix-community/nix-github-actions/e04df33f62cdcf93d73e9a04142464753a16db67' (2024-10-24)
• Added input 'attic/nix-github-actions/nixpkgs':
    follows 'attic/nixpkgs'
• Updated input 'crane':
    'github:ipetkov/crane/109987da061a1bf452f435f1653c47511587d919' (2024-05-24)
  → 'github:ipetkov/crane/498d9f122c413ee1154e8131ace5a35a80d8fa76' (2024-10-27)
• Removed input 'crane/nixpkgs'
• Updated input 'fenix':
    'github:nix-community/fenix/b6fc5035b28e36a98370d0eac44f4ef3fd323df6' (2024-05-22)
  → 'github:nix-community/fenix/87b4d20f896c99018dde4702a9c6157b516f2a76' (2024-11-01)
• Updated input 'fenix/rust-analyzer-src':
    'github:rust-lang/rust-analyzer/21ec8f523812b88418b2bfc64240c62b3dd967bd' (2024-05-19)
  → 'github:rust-lang/rust-analyzer/0ba893e1a00d92557ac91efb771d72eee36ca687' (2024-10-31)
• Updated input 'flake-utils':
    'github:numtide/flake-utils/b1d9ab70662946ef0850d488da1c9019f3a9752a' (2024-03-11)
  → 'github:numtide/flake-utils/c1dfcf08411b08f6b8615f7d8971a2bfa81d5e8a' (2024-09-17)
• Updated input 'nix-filter':
    'github:numtide/nix-filter/3342559a24e85fc164b295c3444e8a139924675b' (2024-03-11)
  → 'github:numtide/nix-filter/776e68c1d014c3adde193a18db9d738458cd2ba4' (2024-10-29)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/5710852ba686cc1fd0d3b8e22b3117d43ba374c2' (2024-05-21)
  → 'github:NixOS/nixpkgs/807e9154dcb16384b1b765ebe9cd2bba2ac287fd' (2024-10-29)
2024-11-07 11:35:48 -08:00
Charles Hall
d42a5ec1f0
avoid overhead when cache sizes are zero
Don't even try taking locks, inserting or removing anything, etc.
2024-11-01 14:25:52 -07:00
Charles Hall
1148c6004f
make all caches individually configurable
Also:

* Removes the `cache_capacity_modifier` option
* Renames the `pdu_cache_capacity` option to `cache.pdu`
2024-11-01 14:25:52 -07:00
Charles Hall
9fab7538a0
scale roomid_spacechunk_cache by modifier
Not scaling this was probably unintentional.
2024-11-01 14:25:52 -07:00
Charles Hall
4083451a10
rename Services constructor 2024-11-01 14:25:52 -07:00
Charles Hall
fefc84e8c7
rename constructor for globals service 2024-11-01 14:25:52 -07:00
Charles Hall
55b605f046
rename constructor for sending service 2024-11-01 14:25:52 -07:00
Charles Hall
4faa8ea37c
rename constructor for admin service 2024-11-01 14:25:52 -07:00
Charles Hall
e06d126d4e
add constructor for users service 2024-11-01 14:25:52 -07:00
Charles Hall
f771d319b2
add constructor for spaces service
Also adds a public function to invalidate the cache rather than exposing
the entire cache publicly.
2024-11-01 14:25:52 -07:00
Charles Hall
f702b6cccd
add constructor for state compressor service 2024-11-01 14:25:52 -07:00
Charles Hall
c6e2f8372c
add constructor for state accessor service 2024-11-01 14:25:52 -07:00
Charles Hall
a083ff9200
add constructor for lazy-loading service 2024-11-01 14:25:52 -07:00
Charles Hall
3b28d0cfda
add constructor for typing service 2024-11-01 14:25:52 -07:00
Charles Hall
b18df8de70
rename appservice service constructor 2024-11-01 14:25:52 -07:00
Charles Hall
5be1e20eb4
call maximize_fd_limit at top of main
This way we don't shoot ourselves in the foot by forgetting to do it for
other subcommands (e.g. that manipulate the database) in the future.
2024-11-01 13:16:36 -07:00
Andreas Fuchs
9529d43a21 ChangeLog entry for check-config subcmd 2024-11-01 12:16:49 -04:00
Andreas Fuchs
dcf64f03fb Validate generated config file in the nixos module
This uses the usual pkgs.runCommand pattern to ensure that no
non-parseable config files can make it to the command line.
2024-11-01 12:10:01 -04:00
Andreas Fuchs
a02c551a5e Disallow any unknown fields in configuration files
This will break backwards compatibility of configurations, but
ensures that a previously-configured setting won't get dropped
arbitrarily. Pretty much worth it, I think.
2024-11-01 12:09:58 -04:00
Andreas Fuchs
26ba489aa3 Add a "check-config" command to validate config files & tests for it 2024-11-01 12:08:17 -04:00
Lambda
70ee206031 Extract source address for requests 2024-10-25 20:48:38 +00:00
Lambda
3247c64cd8 Add support for HAProxy proxy protocol for listeners 2024-10-25 20:48:38 +00:00
Lambda
99f3e2aecd Refactor server listener spawning 2024-10-25 20:47:04 +00:00
Charles Hall
86481fd651
make reload_handles optional for creating Services
This will be useful for instantiating services in CLI subcommands, which
have different requirements around observeability.
2024-10-25 11:27:11 -07:00
Charles Hall
b03c2a15b3
add observability infrastructure for cli subcmds 2024-10-25 11:27:11 -07:00
K900
b93c39ee93 fix: use non-alias name in nixos module 2024-10-24 14:46:44 +00:00
Charles Hall
ce7efc1eff
move lasttimelinecount_cache to service 2024-10-20 13:29:33 -07:00