Commit graph

5 commits

Author SHA1 Message Date
Olivia Lee
5bc3fce257
log message when BackoffGuard is dropped without recording result 2024-11-16 20:52:43 -08:00
Olivia Lee
56f025cb47
metrics for online and offline remote server count 2024-11-16 20:52:43 -08:00
Olivia Lee
5b6aaa19b9
log when servers switch between online and offline 2024-11-16 20:52:43 -08:00
avdb13
080fe5af42
feat: configurable federation backoff 2024-11-16 20:52:13 -08:00
Olivia Lee
9b22c9b40b
add service for tracking backoffs to offline servers
Currently we have some exponential backoff logic scattered in different
locations, with multiple distinct bad implementations. We should
centralize backoff logic in one place and actually do it correctly.

This backoff logic is similar to synapse's implementation[1], with a
couple fixes:

 - we wait until we observe 5 consecutive failures before we start
   delaying requests, to avoid being sensitive to a small fraction of
   failed requests on an otherwise healthy server.
 - synapse's implementation is kinda similar to our "only increment the
   failure count once per batch of concurrent requests" behavoir, where
   they base the retry state written to the store on the state observed
   at the beginning of the request, rather on the state observed at the
   end of the request. Their implementation has a bug, where a success
   will be ignored if a failure occurs in the same batch. We do not
   replicate this bug.

Our parameter choices are significantly less aggressive than synapse[2], which
starts at 10m delay, has a multiplier of 2, and saturates at 4d delay.

[1]: 70b0e38603/synapse/util/retryutils.py
[2]: 70b0e38603/synapse/config/federation.py (L83)
2024-11-16 20:13:09 -08:00