Define GitLab SRE Guidance for Optimizing Single Instance RTO/RPO
"I am an SRE for a small GitLab installation servicing 200 users. While we don't anticipate large scaling needs, we are under regulatory or internal requirements to have a strictly managed RTO and RPO on mission critical DevOps tooling that services all our production apps running with five nines (99.999%) uptime SLAs. We need to keep management as simple as possible and costs reasonable while maintaining a single instance.
The existing 3K architecture is overscaled (costs are too high) for our needs so we are interested in additional configuration guidance that can maximize RTO and RPO for the single instance setup."
Example customer requirements:
- Scale: 0 - 2999.
- Simplest possible to manage for under 500 users.
- Reasonable compute and storage costs.
- Reasonable operations management costs - SRE FTEs and learning curve.
- RTO and RPO measured in the 1-8 hour range.
Assumptions
- The configurations will assume that initial and regular productionized RTO / RPO recovery testing will be used to gauge which configuration to use and when the configuration may need to improve to keep the same RTO / RPO as the instance scales in storage and/or compute.
- The configurations will cite cloud native specific capabilities and services of specific providers where they significantly enhance RTO and RPO of a single instance configuration.
Background
- Backup scalability is a first class concern in single instance configurations.
Considerations
- GEO does not solve the DR recovery of the primary instance after a secondary promotion is done. It would seem that this effort and the quarterly testing of it would be nearly as intense as a simple recovery scenario of a single primary.
Diagram
- Google Drawing TBD
Strawman Example (AWS Oriented)
For small shops I have been advising this iteration cycle to find a balance between cost + complexity and RTO:
Consider iterating over simpler architectures to get to an acceptable RTO (Simpliest / cheapest first):
- Level 1: Single Instance w/ AWS Backup Based Backup (Cross region, multi service) - test recovery RTO. - avoids complex IaC / complex Ops / complex root cause
- Level 2: Optional optimization: S3 offloading for all non-git storage. (Cross region replication for out of region recovery)
- Level 3: Optional - RDS & Elasticache (AWS Backup can cover RDS)
- Level 4: Single instance vertical scaled Gitaly (no praefect, no Praefect RDS) using AWS io2 EBS volumes with multiple GitLab instances in front of it.
- Level 5: AWS FSx for Lustre
- Level 6: Cheapest Hot cross region - GEO to a single instance in another region (keep this one a single instance) - Multiple instances to always upgrade
Use Cases
- Channel Partners setting up GitLab
- GitLab for long and short term learning labs at customer and partner sites
Documentation
- https://docs.gitlab.com/ee/administration/reference_architectures/1k_users.html
- Terms:
- Recovery Time Objective (RTO) is the maximum amount of downtime your business can tolerate without incurring a significant financial loss.
- Recovery Point Objective (RPO) is the interval of time during which your business can recover from data loss brought about by an outage.