When data loss is detected, repositories are marked read only to prevent a split brain if data can be recovered #2630 (closed). Read only repositories are a form of outage, and this needs to be observable so that progress on resolving the outage can be measured.
Proposal
Add Prometheus metrics that report the number of repositories that are marked read only for a virtual storage.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related or that one is blocking others.
Learn more.
@samihiltunen I'm not sure if there is already some form of observability, but we were having a hard time working out what was going on without a Grafana chart. I think it would be worth tackling this in 13.1 so we aren't flying blind on this feature.
@jramsay, if the feature is enabled in the configuration, a failover would turn on the read-only mode. Any attempted writes would return a gRPC FailedPrecondition status code which would be visible in the gRPC error metrics. Should we add a separate counter for read-only mode related RPC errors to separate them from other possible FailedPrecondition statuses?
Add Prometheus metrics that report the number of repositories that are marked read only for a virtual storage.
The read-only mode is virtual-storage wide at the moment. I'll add a gauge to show the virtual storage's read-only status directly, that should help with the visibility a lot.
The read-only mode is virtual-storage wide at the moment.
@samihiltunen to confirm, if we suspect data loss on a single repository, the entire virtual storage is marked read only? We need to think about how we iterate on this to be more targeted...
@jramsay yep, that's right. Making it per repository definitely makes sense and changes required for #2683 (comment 347483537) and especially #2717 (closed) should take us in to the correct direction. Right now, the only way to repair repository after a failover is using reconcile and that works on virtual storage level rather than on individual repositories.
I'll keep this in mind when making the changes related to those two tickets.