Cells Topology Service Production Readiness Review (GA)
Production Readiness
This issue serves as a tracking issue to guide you through the readiness review. It's not the production readiness document itself! The readiness documentation will be added to the project with a merge request, where stakeholders from different teams can collaborate.
Readiness MR
Related epic: gitlab-com/gl-infra/tenant-scale/cells-infrastructure&4
Reviewers
The reviewers will be filled in as one of the steps of the checklist below. If a reviewer in the "Mandatory" section is not allocated, please add the reason why next to the name.
Mandatory
- Runway Team: reviewer name
- InfraSec: reviewer name
Readiness Checklist
The following items should be completed by the person initiating the readiness review:
-
Review the Production Readiness Review handbook page. -
Create this issue and assign it to yourself. - Set a due-date for when you believe the readiness will be completed (this can be updated later if necessary).
- Add an appropriate label to the issue from the list below. Review the labels periodically to ensure the appropriate label is assigned to keep the review progressing.
- workflow-infraTriage : The author has an idea of the feature or change but is pending a decision to proceed with it.
- workflow-infraProposal : A decision to proceed with an idea or change has been made and Readiness MR is being prepared
- workflow-infraReady : The Readiness MR is ready and awaiting review. ssue assigned to the DRI.Author
- workflow-infraIn Progress : Review discussions are ongoing between the DRI and SRE Reviewer. Issue is assigned to the DRI and SRE Reviewer
- workflow-infraDone : The Readiness review is complete, Readiness MR is accepted and merged
- workflow-infraCancelled : Readiness review is no longer required due to other external reasons. After applying this label, issue will be closed.
- workflow-infraStalled : Review is paused due to a change in priority.
- workflow-infraBlocked : Review is blocked due to external dependencies or other factors. Where possible, a blocking issue should also be set.
-
In the "Reviewers" section above, add the reviewer names. Names will be assigned by reaching out to the engineering manager of the corresponding team, do this by @mentioning the team members leading the following groups:.- Runway: Reach out to Runway Team
- Delivery: Reach out to Delivery management
- InfraSec: Create an issue in this team's tracker. More information is available on the Infrastructure Security Team's handbook page. After the issue is created, put a link to the issue next to Infrasec reviewer item below and add the reviewer name after one has been assigned.
-
Create the first draft of the readiness review by copying the template below and submitting an MR. Do not remove any items or section in the template. It is only required to fill in the items up to and including the corresponding maturity level and lower. For example, for ReadinessBeta all sections under Beta and Experiment will need to be completed. -
Assign the initial set reviewers to the MR. Once the MR has been assigned, add the label workflow-infraIn Progress to this issue. -
Add a link to the MR in the "Readiness MR" section at the top of this issue -
Once the MR has been sent out for review, add a ~"Readiness::*scoped label for the corresponding target maturity level for the review. -
When last review of the MR is complete, and it is merged do one of the following: - If the feature will remain at the current maturity level for an uncertain amount of time, close the issue and add a
~"workflow-infra::done"label to the issue. - If the feature will need to reviewed for the next maturity level soon, add the corresponding
~"Readiness::*scoped label and repeat the process using the same issue.
- If the feature will remain at the current maturity level for an uncertain amount of time, close the issue and add a
-
(Optional) If it is later decided to not proceed with this proposal, add workflow-infraCancelled and close this issue
Readiness MR Template
Expand the section below to view the readiness template, this will be the starting point for the readiness merge request.
Create <name>/index.md as a new merge request with the following content where is something short and descriptive for the change being proposed
The Readiness Review document is designed to help you prepare your features and services for the GitLab Production Platforms. Please engage with the relevant teams as soon as possible to begin review even if there are incomplete items below. All sections should be completed up to the current maturity level. For example, if the target maturity is "Beta", then items under "Experiment" and "Beta" should be completed.
While it is encouraged for parts of this document to be filled out, not all of the items below will be relevant. Leave all non-applicable items intact and add 'N/A' or reasons for why in place of the response. This Guide is just that, a Guide. If something is not asked, but should be, it is strongly encouraged to add it as necessary.
Beta
Monitoring and Alerting
The items below will be reviewed by the Runway team.
-
Link to examples of logs on https://logs.gitlab.net -
Link to the Grafana dashboard for this service. -
[Security Compliance] If applicable, does the new service have the Wiz runtime sensor insalled? If unsure, consult the InfraSec team in #security-infrasec Slack channel to determine applicability.
Backup, Restore, DR and Retention
The items below will be reviewed by the Runway team.
-
[Security Compliance] Are there custom backup/restore requirements? -
[Security Compliance] Are backups monitored? -
[Security Compliance] Was a restore from backup tested? -
Link to information about growth rate of stored data. -
[Security Compliance] Will backups be configured to be compliant with GitLab.com backup policies (if applicable)?
Deployment
The items below will be reviewed by the Delivery team.
-
[Security Compliance] Will a change management issue be used for rollout? If so, link to it here. -
[Security Compliance] Will subsequent changes to the service follow the GitLab Change Management Standard? Changes made via MR in projects with appropriately configured merge/branch settings are automatically compliant for SOD and required approvals. -
[Security Compliance] Will relevant Gitlab projects utilized to manage the service have appropriate MR approval/Protected branch settings per the GitLab Projects Baseline Requirements page? -
Does this feature have any version compatibility requirements with other components (e.g., Gitaly, Sidekiq, Rails) that will require a specific order of deployments? -
Is this feature validated by our QA blackbox tests? -
Will it be possible to roll back this feature? If so explain how it will be possible.
Security
The items below will be reviewed by the InfraSec team.
-
Put yourself in an attacker's shoes and list some examples of "What could possibly go wrong?". Are you OK going into Beta knowing that? -
Link to any outstanding security-related epics & issues for this feature. Are you OK going into Beta with those still on the TODO list?
General Availability
Monitoring and Alerting
The items below will be reviewed by the Runway team.
-
Link to the troubleshooting runbooks. -
Link to an example of an alert and a corresponding runbook. Runway automatically generates alerts for errors, apdex, and traffic cessation. Runway automatically generates alerts for saturation resources.
- dashboards.gitlab.net/d/alerts-sat_runway_container_cpu/e0a90721-e5ee-5feb-aed8-11cbff05b7ee?var-PROMETHEUS_DS=mimir-runway&var-type=Runway Service ID
- dashboards.gitlab.net/d/alerts-sat_runway_container_memory/f4132aba-3db6-5019-b238-e0b0d481596b?var-PROMETHEUS_DS=mimir-runway&var-type=Runway Service ID
-
Confirm that on-call SREs have access to this service and will be on-call. If this is not the case, please add an explanation here. Runway uses GCP projects that on-call SREs have access to.
Operational Risk
The items below will be reviewed by the Runway team.
-
Link to notes or testing results for assessing the outcome of failures of individual components. -
What are the potential scalability or performance issues that may result with this change? -
What are a few operational concerns that will not be present at launch, but may be a concern later? -
Are there any single points of failure in the design? If so list them here. -
As a thought experiment, think of worst-case failure scenarios for this product feature, how can the blast-radius of the failure be isolated?
Backup, Restore, DR and Retention
The items below will be reviewed by the Runway team.
-
Are there any special requirements for Disaster Recovery for both Regional and Zone failures beyond our current Disaster Recovery processes that are in place? -
How does data age? Can data over a certain age be deleted?
Performance, Scalability and Capacity Planning
The items below will be reviewed by the Runway team.
-
Link to any performance validation that was done according to performance guidelines. -
Link to any load testing plans and results. -
Are there any potential performance impacts on the Postgres database or Redis when this feature is enabled at GitLab.com scale? -
Explain how this feature uses our rate limiting features. -
Are there retry and back-off strategies for external dependencies? -
Does the feature account for brief spikes in traffic, at least 2x above the expected rate? Runway automatically scales instances based on traffic.
Deployment
The items below will be reviewed by the Delivery team.
-
Will a change management issue be used for rollout? If so, link to it here. -
Are there healthchecks or SLIs that can be relied on for deployment/rollbacks? Runway automatically rolls back deployments experiencing elevated error rates. Runway supports healthchecks for startup and liveness probes.
-
Does building artifacts or deployment depend at all on gitlab.com? Runway automatically depends on GitLab.com. Runway optionally deploying from ops for critical services.