The previous logic would increment the backoff counter both when a
request actually fails and when we do not make a request because the
server was already in backoff. This lead to a positive feedback loop
where every request made while a server is in backoff increases the
backoff delay, making it impossible to recover from backoff unless the
entire backoff delay elapses with zero requests.
Failing to reset the backoff state resulted in a monotonically
increasing backoff delay. If a remote server was temporarily
unavailable, we would have a persistently increased rate of key query
failures until the backoff state was reset by a server restart. If
enough key queries were attempted while the remote was unavailable, it
can accumulate an arbitrarily long backoff delay and effectively block
all future key queries to this server.
This is pure code-motion, with no behavior changes. The new structure
will make it easier to fix the backoff behavior, and makes the code
somewhat less of a nightmare to follow.
This is okay because we only allow fast-forward merges, changes always
go through MRs, and MRs are subject to, e.g., the code quality checks.
This change is desirable because it should save some time and energy.
GitLab doesn't seem to have built-in support for this because of course
it doesn't.
To do this, we move the job scripts to a different file to make it
possible to share code between job scripts.
This way we don't have to be so careful about where we call `direnv
exec .` and can safely assume e.g. in scripts that we're always in the
direnv environment.
* Capitalize Conduit and Grapevine
* Replace a "Not" with "Note"
* Change a "Conduit 0.7.0" to "Conduit <0.8.0"
* Expand "db" to "database" and "database schema" as appropriate
This was based on the similar section in the hedgedoc. I dropped the bit
about grapevine users getting banned from the conduwuit rooms cause we
haven't heard about that happening for a while and the original case was
pretty unclear.
Previously attempting to delete an MXC that is only associated with
dangling thumbnails would fail, because it assumes that every thumbnail
must have a corresponding original in the db, and errors out if it can't
find the original. This is incorrect because we create dangling
thumbnails when requesting a remote thumbnail over federation when we
don't have the original file.
When requesting remote thumbnails over federation, we can end up with a
thumbnail in the media db without an associated original file. Because
of this, skipping thumbnails is insufficient to get a list of all MXCs.
Previously we were only skipping thumbnails that had both dimensions
nonzero. I think the assumption was that only the dimensions allowed by
services::media::thumbnail_properties would be used. This is not the
case because we have used arbitrary thumbnail dimensions when requesting
remote thumbnails.
Now that we are able to distinguish between corrupted media keys and
missing files, it makes more sense to propagate the corrupted keys up to
the caller.
This is useful to easily distinguish missing files from corrupted keys.
All existing usage sites have been modified so there is no behavior
change in this commit.
Ruma dropped a couple dependencies and includes a stateres performance
improvement. May as well pull in everything else (except OTel) while
we're at it.
We *should* ensure that media deletion is always successful, but when a
bug causes a single object to fail deletion it's better to try to delete
the remaining objects than to give up entirely.
Our current code should never write these, because we have an allowlist
of thumbnail sizes. None of the allowed sizes contain a 0xFF byte. We
have observed keys with a 0xFF in the thumbnail size a couple times on
real servers, and believe an early version of conduit wrote these before
the allowlist was added.
These keys were originally handled correctly, and were broken by
e2cba15ed2. Before that commit, we were
parsing media keys backwards, and never tried to read the thumbnail size
or mxc url.