Skip to content
Snippets Groups Projects
Commit 8c44c9c0 authored by John Jarvis's avatar John Jarvis
Browse files

Updates the readiness template

parent 9934e8e9
No related branches found
No related tags found
1 merge request!108Updates the readiness template
# Production Readiness Guide
/label ~"workflow-infra::Proposal"
As per [the handbook description](https://about.gitlab.com/handbook/engineering/infrastructure/production/readiness/#starting-a-proposal)
this issue is a tool that will help prepare the readiness document. **It's not the production readiness
document itself!**.
## Production Readiness
**The readiness documentation should be added to the [project](https://gitlab.com/gitlab-com/gl-infra/readiness/)
with a merge request, where different interested parties can collaborate. You can start with a simple
one-page proposal as in the [example merge request](https://gitlab.com/gitlab-com/gl-infra/readiness/-/merge_requests/1).**
For any new or changes to a feature or service in production, the questions in this guide will help to make these changes more robust when they are enabled on GitLab.com.
Before starting, please review the [Production Readiness Review](https://about.gitlab.com/handbook/engineering/infrastructure/production/readiness/) document in the handbook.
**This issue serves as a tracking issue to guide you through the readiness review. It's not the production readiness document itself!**.
**The readiness documentation will be added to the [project](https://gitlab.com/gitlab-com/gl-infra/readiness/) with a merge request, where different interested parties can collaborate.**
---
For any new or existing large feature set the questions in this guide help to make them more robust and prevent impacting the availability of our platform.
## Readiness MR
{+ Add link to the readiness MR when it is created +}
## Reviewers
The reviewers should be filled in as one of the steps of the checklist below.
If a reviewer in the "Mandatory" section is not allocated, please add the reason why next to the name.
**The reviewer will check the box next to their name when the review is complete**
### Mandatory
- [ ] Reliability: {+ reviewer name +}
- [ ] Delivery: {+ reviewer name +}
- [ ] InfraSec: {+ reviewer name +}
Initially, this guide is likely to be used by Production Engineers who are embedded with other teams working on existing services and features. However, anyone working on a new feature set is encouraged to use this guide as well.
### Optional
The goal of this guide is to help others understand how the new feature set may impact the rest of the (production) system; what steps need to be taken (besides deploying this new feature) to ensure that it can be properly managed; and to understand what it will take to manage the reliability of the new system / feature / service beyond its' initial deployment.
_Delete these reviewers if they do not apply_
For readiness review of *infrastructure services* use this issue template instead:
[service_readiness.md](service_readiness.md)
- [ ] Development: {+ reviewer name +}
- [ ] Scalability: {+ reviewer name +}
- [ ] Database: {+ reviewer name +}
While we strive and encourage all parts of this document to be filled out, all
sections are mandatory and shall considered a blocker prior to the review being
completed. If some questions do not have an answer, or are potentially not
relevant to the service in question, leave the question intact and state the
reasons for not having an answer or note why it isn't relevant.
## Readiness Checklist
This Guide is just that, a Guide. If something is not asked, but should be, it
is strongly encouraged to add it as necessary. If you feel like something is missing, please consider submitting an MR against the template.
The following items should be completed by the person initiating the readiness review:
- [ ] Create this issue and assign it to yourself. Set a due-date for when you believe the readiness will be completed (this can be updated later if necessary).
- [ ] Review the [Production Readiness Review](https://about.gitlab.com/handbook/engineering/infrastructure/production/readiness/) handbook page.
- [ ] In the "Reviewers" section above, add the reviewer names. Names will be assigned by reaching out to the engineering manager of the corresponding team.
- [ ] Create the first draft of the readiness review by copying the template below and submitting an MR, add the label ~"workflow-infra::In Progress" to this issue.
- [ ] Add a link to the MR in the "Readiness MR" section at the top of this issue
- [ ] Assign the initial set reviewers to the MR. There can be multiple iterations of MR if needed, often it is helpful to have the first draft reviewed by team members in the same team. **Approval of the MR does not mean the readiness document is approved, approvals will be done later on this issue.**
- [ ] When last review of the MR is complete, ask the reviewers in the "Reviewers" section above to check the box next to their name if they are satisfied with the review and have no more questions or concerns.
- [ ] (Optional) If it is later decided to not proceed with this proposal, add ~"workflow-infra::Cancelled" and close this issue
When all boxes have been check in the "Reviewers" section, add the ~"workflow-infra::Done" label and close the issue.
## Readiness MR Template
Expand the section below to view the readiness template, this will be the starting point for the readiness merge request.
**Create `<name>/index.md` as a new merge request with the following content where <name> is something short and descriptive for the change being proposed**
<details>
_While it is encouraged for parts of this document to be filled out, not all of the items below will be relevant. Leave all non-applicable items intact and add the reasons for why in place of the response._
_This Guide is just that, a Guide. If something is not asked, but should be, it is strongly encouraged to add it as necessary._
## Summary
......@@ -106,3 +141,5 @@ is strongly encouraged to add it as necessary. If you feel like something is mi
- [ ] **Describe the load test plan used for this feature. What breaking points were validated?**
- [ ] **For the component failures that were theorized for this feature, were they tested? If so include the results of these failure tests.**
- [ ] **Give a brief overview of what tests are run automatically in GitLab's CI/CD pipeline for this feature?**
</details>
# Operational Readiness Guide for Infrastructure Services
For new services introduced in our GitLab SaaS infrastructure we want to review
them for operational readiness before going live. This checklist should be a
guide to help covering all important aspects for an infrastructure readiness
review - not all points are mandatory, apply what makes sense. Goal of the
readiness review should be to identify gaps, create issues for them, link them
in the review and bring them to a solution.
The readiness review should mostly link to design docs or runbooks for referral.
It is recommended to write most of the information falling out of the readiness
review into the
[README.md](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/README.md#readmemd)
in the services' runbook directory. Runbooks should be the main source of
truth. Design docs and readiness reviews tend to be point-in-time snapshots and
should not duplicate information in runbooks.
This guide also can be used to retroactively review already existing services.
For readiness review of new product features use this issue template:
[production_readiness.md](production_readiness.md)
## Summary
* [ ] Short overview mentioning purpose of the service, dependencies and owners
* [ ] Explain the scope of this review and what is explicitly out of scope.
## Architecture
* [ ] Runbook README.md contains an architecture overview (provide link)
* [ ] Runbook README.md contains a logical architecture diagram
* [ ] Runbook README.md contains a physical architecture diagram (optional)
* [ ] Runbook README.md provides enough information for a reviewer to get an
understanding of the service and it's components, dependencies and
interactions
* [ ] Service Configurations are audited and our chosen values
are explained
* This includes configurations that are dropped in files, environment
variables, as well as command line options to the service binary
* [ ] Differences for how the Service is configured between differing
environments are explained
* [ ] Readiness reviews are completed on any supporting service that is
introduced or heavily relied upon
## Documentation
* [ ] is there a blue print/design doc? (provide link)
* [ ] do we have runbooks? (provide links)
* [ ] are runbooks up-to-date?
* [ ] where else is documentation for this service located?
* [ ] is there a service catalog entry? (provide link)
* [ ] is service catalog listing all dependencies?
* [ ] has service catalog links to all existing documentation?
* [ ] is service catalog linking to readiness review?
## Performance
* [ ] is there a runbook section with performance characteristics? (it should
cover following considerations, provide link)
* [ ] current requests/s (min, max, average), latency characteristics,
saturation, ...
* [ ] throtteling/limits
* [ ] bottlenecks (cpu-bound, memory-bound, ...)
* [ ] is there documentation on how/why we set certain config options that are
affecting performance?
## Scalability
* [ ] is there a runbook section with scalability information? (it should cover
following considerations, provide link)
* [ ] expected load in the future
* [ ] how can we scale to the expected load?
* [ ] can it be scaled across availability zones or regions?
* [ ] are there scalability limitations?
* [ ] are we doing performance tests?
## Availability
* [ ] is there a runbook section covering availability considerations? (it
should cover following topics, provide link)
* [ ] failure modes of this service, blast radius, how long does it take to
recover?
* [ ] what happens on outage of services we are depending on?
* [ ] Availability Zone (AZ) outage
* [ ] split brain between AZs
* [ ] region outage
* [ ] other external dependencies that could affect availability
* [ ] what other services are affected by an outage of this service?
* [ ] is there an existing Recovery Time Objective (RTO) documented? How do we
plan to achieve it?
* [ ] do we have an error budget?
* [ ] are we doing disaster recovery tests?
* [ ] is there a failover procedure? Do we have runbook instructions?
* [ ] do the oncall rotations responsible for this service have access to this service?
## Durability
* [ ] is there a runbook section covering durability considerations? (it should
cover following topics, provide link)
* [ ] possible failure modes and how to recover from them
* [ ] deletion by accident
* [ ] disk failure
* [ ] data corruption
* [ ] GCP outage
* ...
* [ ] is there an existing Recovery Point Objective (RPO) documented? How do
we plan to achieve it?
* Backups
* [ ] are we testing backup replay?
* [ ] are we monitoring backups?
* [ ] what is the backup retention policy?
* [ ] are backups in a different logical and physical environment?
## Security/Compliance
* [ ] is there a runbook section covering security considerations? (it should
cover following topics, provide link)
* [ ] list of access roles
* [ ] Who has which role?
* [ ] How do we protect access?
* [ ] Auditability of access
* [ ] Which entrypoints need protection?
* [ ] How are we applying security updates? (OS and service)
* [ ] Regulations/Policies applying? (PII, SOX, ...)
* [ ] how do we protect customer data?
* [ ] encryption at rest?
* [ ] could customer data leak in logs?
* [ ] how long do we keep logs?
* [ ] is someone from security included for the readiness review?
## Monitoring
* [ ] is there a runbook section covering monitoring? (it should
cover following topics, provide link)
* [ ] list key SLIs. Are we monitoring them?
* [ ] list SLOs. Are we monitoring/alerting on them?
* [ ] list of relevant alerts
* [ ] are alerts actionable and linking to a runbook?
* [ ] do we have a metrics catalog entry for the service? (provide link)
* [ ] list of relevant dashboards
* [ ] list of relevant logs
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment