Define GitLab SRE Guidance for Optimizing Single Instance RTO/RPO

"I am an SRE for a small GitLab installation servicing 200 users. While we don't anticipate large scaling needs, we are under regulatory or internal requirements to have a strictly managed RTO and RPO on mission critical DevOps tooling that services all our production apps running with five nines (99.999%) uptime SLAs. We need to keep management as simple as possible and costs reasonable while maintaining a single instance.

The existing 3K architecture is overscaled (costs are too high) for our needs so we are interested in additional configuration guidance that can maximize RTO and RPO for the single instance setup."

Example customer requirements:

Scale: 0 - 2999.
Simplest possible to manage for under 500 users.
Reasonable compute and storage costs.
Reasonable operations management costs - SRE FTEs and learning curve.
RTO and RPO measured in the 1-8 hour range.

Assumptions

The configurations will assume that initial and regular productionized RTO / RPO recovery testing will be used to gauge which configuration to use and when the configuration may need to improve to keep the same RTO / RPO as the instance scales in storage and/or compute.
The configurations will cite cloud native specific capabilities and services of specific providers where they significantly enhance RTO and RPO of a single instance configuration.

Background

Backup scalability is a first class concern in single instance configurations.

Considerations

GEO does not solve the DR recovery of the primary instance after a secondary promotion is done. It would seem that this effort and the quarterly testing of it would be nearly as intense as a simple recovery scenario of a single primary.

Diagram

Google Drawing TBD

Strawman Example (AWS Oriented)

For small shops I have been advising this iteration cycle to find a balance between cost + complexity and RTO:

Consider iterating over simpler architectures to get to an acceptable RTO (Simpliest / cheapest first):

Level 1: Single Instance w/ AWS Backup Based Backup (Cross region, multi service) - test recovery RTO. - avoids complex IaC / complex Ops / complex root cause
Level 2: Optional optimization: S3 offloading for all non-git storage. (Cross region replication for out of region recovery)
Level 3: Optional - RDS & Elasticache (AWS Backup can cover RDS)
Level 4: Single instance vertical scaled Gitaly (no praefect, no Praefect RDS) using AWS io2 EBS volumes with multiple GitLab instances in front of it.
Level 5: AWS FSx for Lustre
Level 6: Cheapest Hot cross region - GEO to a single instance in another region (keep this one a single instance) - Multiple instances to always upgrade

Use Cases

Channel Partners setting up GitLab
GitLab for long and short term learning labs at customer and partner sites

Documentation

https://docs.gitlab.com/ee/administration/reference_architectures/1k_users.html
Terms:
- Recovery Time Objective (RTO) is the maximum amount of downtime your business can tolerate without incurring a significant financial loss.
- Recovery Point Objective (RPO) is the interval of time during which your business can recover from data loss brought about by an outage.

Edited Nov 16, 2021 by DarwinJS