add service for tracking backoffs to offline servers

Currently we have some exponential backoff logic scattered in different
locations, with multiple distinct bad implementations. We should
centralize backoff logic in one place and actually do it correctly.

This backoff logic is similar to synapse's implementation[1], with a
couple fixes:

 - we wait until we observe 5 consecutive failures before we start
   delaying requests, to avoid being sensitive to a small fraction of
   failed requests on an otherwise healthy server.
 - synapse's implementation is kinda similar to our "only increment the
   failure count once per batch of concurrent requests" behavoir, where
   they base the retry state written to the store on the state observed
   at the beginning of the request, rather on the state observed at the
   end of the request. Their implementation has a bug, where a success
   will be ignored if a failure occurs in the same batch. We do not
   replicate this bug.

Our parameter choices are significantly less aggressive than synapse[2], which
starts at 10m delay, has a multiplier of 2, and saturates at 4d delay.

[1]: 70b0e38603/synapse/util/retryutils.py
[2]: 70b0e38603/synapse/config/federation.py (L83)
This commit is contained in:
Olivia Lee 2024-08-10 23:32:01 -07:00
parent 93c714a199
commit 9b22c9b40b
No known key found for this signature in database
GPG key ID: 54D568A15B9CD1F9
3 changed files with 254 additions and 1 deletions

View file

@ -12,6 +12,7 @@ pub(crate) mod pdu;
pub(crate) mod pusher;
pub(crate) mod rooms;
pub(crate) mod sending;
pub(crate) mod server_backoff;
pub(crate) mod transaction_ids;
pub(crate) mod uiaa;
pub(crate) mod users;
@ -35,6 +36,7 @@ pub(crate) struct Services {
pub(crate) globals: globals::Service,
pub(crate) key_backups: key_backups::Service,
pub(crate) media: media::Service,
pub(crate) server_backoff: Arc<server_backoff::Service>,
pub(crate) sending: Arc<sending::Service>,
}
@ -120,6 +122,7 @@ impl Services {
media: media::Service {
db,
},
server_backoff: server_backoff::Service::build(),
sending: sending::Service::new(db, &config),
globals: globals::Service::new(db, config, reload_handles)?,