Previously we were mashing everything together as RoomAccountDataEvent,
even the global events. This technically worked, because of the hidden
custom fields on the ruma event types, but it's confusing and easy to
mess up. Separate methods with appropriate types are preferable.
Previously we were only using trust-dns for resolving SRV records in
server discovery, and then for resolving the hostname from the SRV
record target if one exists. With the previous behavior, admins need to
ensure that both their system resolver and trust-dns are working
correctly in order for outgoing traffic to work reliably. This can be
confusing to debug, because it's not obvious to the admin if or when
each resolver are being used. Now, everything goes through trust-dns and
outgoing federation DNS should fail/succeed more predictably.
I also expect some performance improvement from having an in-process DNS
cache, but haven't taken measurements yet.
The primary motivation for this change is to support databases that
don't take a path, e.g. out of process databases.
This configuration structure leaves the door open for other media
storage mechanisms in the future, such as S3.
It's also structured to avoid `#[serde(flatten)]` so that we can use
`#[serde(deny_unknown_fields)]`.
This method did _a lot_ of things at the same time. In order to use
`KeyValueDatabase` for the migrate-db command, we need to be able to
open a db without attempting to apply all the migrations and without
spawning a bunch of unrelated background tasks.
The state after this refactor is still not great, but it's enough to do
a migration tool.
ReloadHandle is taken from conduwuit commit
8a5599adf9eafe9111f3d1597f8fb333b8b76849, authored by Benjamin.
Co-authored-by: Benjamin Lee <benjamin@computer.surgery>
This renames:
database_backend -> database.backend
database_path -> database.path
db_cache_capacity_mb -> database.cache_capacity_mb
rocksdb_max_open_files -> database.rocksdb_max_open_files
Charles updated the NixOS module.
Co-authored-by: Charles Hall <charles@computer.surgery>
Previously, we only fetched keys once, only requesting them again if we have any missing, allowing for ancient keys to be used to sign PDUs and transactions
Now we refresh keys that either have or are about to expire, preventing attacks that make use of leaked private keys of a homeserver
We also ensure that when validating PDUs or transactions, that they are valid at the origin_server_ts or time of us receiving the transaction respectfully
As to not break event authorization for old rooms, we need to keep old keys around
We move verify_keys which we no longer see in direct requests to the origin to old_verify_keys
We keep old_verify_keys indefinitely as mentioned above, as to not break event authorization (at least until a future MSC addresses this)
Original patch by Matthias. Benjamin just rebased it onto grapevine and
fixed clippy/rustc warnings.
Co-authored-by: Benjamin Lee <benjamin@computer.surgery>
Unfortunately we need to pull tracing-opentelemetry from git because
there hasn't been a release including the dependency bump on the other
opentelemetry crates.
This cache can serve invalid responses, and has an extremely low hit
rate.
It serves invalid responses because because it's only keyed off
the `since` parameter, but many of the other request parameters also
affect the response or it's side effects. This will become worse once we
implement filtering, because there will be a wider space of parameters
with different responses. This problem is fixable, but not worth it
because of the low hit rate.
The low hit rate is because normal clients will always issue the next
sync request with `since` set to the `prev_batch` value of the previous
response. The only time we expect to see multiple requests with the same
`since` is when the response is empty, but we don't cache empty
responses.
This was confirmed experimentally by logging cache hits and misses over
15 minutes with a wide variety of clients. This test was run on
matrix.computer.surgery, which has only a few active users, but a
large volume of sync traffic from many rooms. Over the test period, we
had 3 hits and 5309 misses. All hits occurred in the first minute, so I
suspect that they had something to do with client recovery from an
offline state. The clients that were connected during the test are:
- element web
- schildichat web
- iamb
- gomuks
- nheko
- fractal
- fluffychat web
- fluffychat android
- cinny web
- element android
- element X android
Fixes: #2
This change is fully automated, except the `rustfmt.toml` changes and
a few clippy directives to allow specific functions with too many lines
because they are longer now.