Currently we have some exponential backoff logic scattered in different
locations, with multiple distinct bad implementations. We should
centralize backoff logic in one place and actually do it correctly.
This backoff logic is similar to synapse's implementation[1], with a
couple fixes:
- we wait until we observe 5 consecutive failures before we start
delaying requests, to avoid being sensitive to a small fraction of
failed requests on an otherwise healthy server.
- synapse's implementation is kinda similar to our "only increment the
failure count once per batch of concurrent requests" behavoir, where
they base the retry state written to the store on the state observed
at the beginning of the request, rather on the state observed at the
end of the request. Their implementation has a bug, where a success
will be ignored if a failure occurs in the same batch. We do not
replicate this bug.
Our parameter choices are significantly less aggressive than synapse[2], which
starts at 10m delay, has a multiplier of 2, and saturates at 4d delay.
[1]: 70b0e38603/synapse/util/retryutils.py
[2]: 70b0e38603/synapse/config/federation.py (L83)