Project 'gitlab-com/infrastructure' was moved to 'gitlab-com/gl-infra/production-engineering'. Please update any links and bookmarks that may still have the old path.
Production readiness review for Geo
EDIT:
Issue was closed in favor of https://docs.google.com/document/d/1ShWIgTtzcIcRfSQo8G8HQjyTdMzhqmjJwtGG5ovR1WM/edit
When this issue is complete we should feel comfortable deploying Geo to gitlab.com
Summary
- Provide a high level summary of this new product feature. Explain how this change will benefit GitLab customers. Enumerate the customer use-cases.
- What metrics, including business metrics, should be monitored to ensure will this feature launch will be a success?
Architecture
- Add architecture diagrams to this issue of feature components and how they interact with existing GitLab components. Include internal dependencies, ports, security policies, etc.
- Describe each component of the new feature and enumerate what it does to support customer use cases.
- For each component and dependency, what is the blast radius of failures? Is there anything in the feature design that will reduce this risk?
- If applicable, explain how this new feature will scale and any potential single points of failure in the design.
Operational Risk Assessment
- What are the potential scalability or performance issues that may result with this change?
- List the external and internal dependencies to the application (ex: redis, postgres, etc) for this feature and how the it will be impacted by a failure of that dependency.
- Were there any features cut or compromises made to make the feature launch?
- List the top three operational risks when this feature goes live.
- What are a few operational concerns that will not be present at launch, but may be a concern later?
- Can the new product feature be safely rolled back once it is live, can it be disabled using a feature flag?
- Document every way the customer will interact with this new feature and how customers will be impacted by a failure of each interaction.
- As a thought experiment, think of worst-case failure scenarios for this product feature, how can the blast-radius of the failure be isolated?
Database
- If we use a database, is the data structure verified and vetted by the database team?
- Do we have an approximate growth rate of the stored data (for capacity planning)?
- Can we age data and delete data of a certain age?
Security
- Were the gitlab security development guidelines followed for this feature?
- If this feature requires new infrastructure, will it be updated regularly with OS updates?
- Has effort been made to obscure or elide sensitive customer data in logging?
- Is any potentially sensitive user-provided data persisted? If so is this data encrypted at rest?
Performance
- Explain what validation was done following GitLab's performance guidlines please explain or link to the results below
- Are there any potential performance impacts on the database when this feature is enabled at GitLab.com scale?
- Are there any throttling limits imposed by this feature? If so how are they managed?
- If there are throttling limits, what is the customer experience of hitting a limit?
- For all dependencies external and internal to the application, are there retry and back-off strategies for them?
- Does the feature account for brief spikes in traffic, at least 2x above the expected TPS?
Backup and Restore
- Outside of existing backups, are there any other customer data that needs to be backed up for this product feature?
- Are backups monitored?
- Was a restore from backup tested?
Monitoring and Alerts
- Is the service logging in JSON format and are logs forwarded to logstash?
- Is the service reporting metrics to Prometheus?
- How is the end-to-end customer experience measured?
- Do we have a target SLA in place for this service?
- Do we know what the indicators (SLI) are that map to the target SLA?
- Do we have alerts that are triggered when the SLI's (and thus the SLA) are not met?
- Do we have troubleshooting runbooks linked to these alerts?
- What are the thresholds for tweeting or issuing an official customer notification for an outage related to this feature?
Responsibility
- Which individuals are the subject matter experts and know the most about this feature?
- Which team or set of individuals will take responsibility for the reliability of the feature once it is in production?
- Is someone from the team who built the feature on call for the launch? If not, why not?
Testing
- Describe the load test plan used for this feature. What breaking points were validated?
- For the component failures that were theorized for this feature, were they tested? If so include the results of these failure tests.
- Give a brief overview of what tests are run automatically in GitLab's CI/CD pipeline for this feature?