grapevine

mirror of https://gitlab.computer.surgery/matrix/grapevine.git synced 2025-12-17 15:51:23 +01:00

Author	SHA1	Message	Date
Olivia Lee	5bc3fce257	log message when BackoffGuard is dropped without recording result	2024-11-16 20:52:43 -08:00
Olivia Lee	56f025cb47	metrics for online and offline remote server count	2024-11-16 20:52:43 -08:00
Olivia Lee	5b6aaa19b9	log when servers switch between online and offline	2024-11-16 20:52:43 -08:00
avdb13	080fe5af42	feat: configurable federation backoff	2024-11-16 20:52:13 -08:00
Olivia Lee	9b22c9b40b	add service for tracking backoffs to offline servers Currently we have some exponential backoff logic scattered in different locations, with multiple distinct bad implementations. We should centralize backoff logic in one place and actually do it correctly. This backoff logic is similar to synapse's implementation[1], with a couple fixes: - we wait until we observe 5 consecutive failures before we start delaying requests, to avoid being sensitive to a small fraction of failed requests on an otherwise healthy server. - synapse's implementation is kinda similar to our "only increment the failure count once per batch of concurrent requests" behavoir, where they base the retry state written to the store on the state observed at the beginning of the request, rather on the state observed at the end of the request. Their implementation has a bug, where a success will be ignored if a failure occurs in the same batch. We do not replicate this bug. Our parameter choices are significantly less aggressive than synapse[2], which starts at 10m delay, has a multiplier of 2, and saturates at 4d delay. [1]: `70b0e38603/synapse/util/retryutils.py` [2]: `70b0e38603/synapse/config/federation.py (L83)`	2024-11-16 20:13:09 -08:00

5 commits