Readiness Review: Migrate Thanos to Helm

Production Readiness

For any new or changes to a feature or service in production, the questions in this guide will help to make these changes more robust when they are enabled on GitLab.com.

Before starting, please review the Production Readiness Review document in the handbook.

This issue serves as a tracking issue to guide you through the readiness review. It's not the production readiness document itself!.

The readiness documentation will be added to the project with a merge request, where different interested parties can collaborate.

Readiness MR

!151 (merged) - Draft: starting updated Thanos Readiness

Reviewers

The reviewers should be filled in as one of the steps of the checklist below. If a reviewer in the "Mandatory" section is not allocated, please add the reason why next to the name.

If you are unsure who should be assigned as a reviewer, please reach out to any Infrastructure Engineering Manager for assistance.

To have a reviewer assigned from the InfraSec team, please create an issue in the issue tracker dedicated to Business as Usual (BAU). This will help the team to triage the review and start working on it. More information is available on the team's handbook page. After the issue is created, put a link to the issue next to Infrasec reviewer item below and add the reviewer name after one has been assigned.

The reviewer will check the box next to their name when the review is complete

Mandatory

Reliability: @nduff
Delivery: @skarbek ?
InfraSec: @joe-dub to delegate

Optional

Delete these reviewers if they do not apply

@dawsmith: TBD - we may want to have some of these team members see or just demo.

Development: reviewer name
Scalability: reviewer name
Database: reviewer name

Readiness Checklist

The following items should be completed by the person initiating the readiness review:

Create this issue and assign it to yourself. Set a due-date for when you believe the readiness will be completed (this can be updated later if necessary).
Review the Production Readiness Review handbook page.
In the "Reviewers" section above, add the reviewer names. Names will be assigned by reaching out to the engineering manager of the corresponding team.
Create the first draft of the readiness review by copying the template below and submitting an MR, add the label workflow-infraIn Progress to this issue.
Add a link to the MR in the "Readiness MR" section at the top of this issue
Assign the initial set reviewers to the MR. There can be multiple iterations of MR if needed, often it is helpful to have the first draft reviewed by team members in the same team. Approval of the MR does not mean the readiness document is approved, approvals will be done later on this issue.
When last review of the MR is complete, ask the reviewers in the "Reviewers" section above to check the box next to their name if they are satisfied with the review and have no more questions or concerns.
(Optional) If it is later decided to not proceed with this proposal, add workflow-infraCancelled and close this issue

When all boxes have been check in the "Reviewers" section, add the workflow-infraDone label and close the issue.

Readiness MR Template

Expand the section below to view the readiness template, this will be the starting point for the readiness merge request.

Create <name>/index.md as a new merge request with the following content where is something short and descriptive for the change being proposed

While it is encouraged for parts of this document to be filled out, not all of the items below will be relevant. Leave all non-applicable items intact and add the reasons for why in place of the response. This Guide is just that, a Guide. If something is not asked, but should be, it is strongly encouraged to add it as necessary.

Summary

Provide a high level summary of this new product feature. Explain how this change will benefit GitLab customers. Enumerate the customer use-cases.
What metrics, including business metrics, should be monitored to ensure will this feature launch will be a success?

Architecture

Add architecture diagrams to this issue of feature components and how they interact with existing GitLab components. Make sure to include the following: Internal dependencies, ports, encryption, protocols, security policies, etc.
Describe each component of the new feature and enumerate what it does to support customer use cases.
For each component and dependency, what is the blast radius of failures? Is there anything in the feature design that will reduce this risk?
If applicable, explain how this new feature will scale and any potential single points of failure in the design.

Operational Risk Assessment

What are the potential scalability or performance issues that may result with this change?
List the external and internal dependencies to the application (ex: redis, postgres, etc) for this feature and how the service will be impacted by a failure of that dependency.
Were there any features cut or compromises made to make the feature launch?
List the top three operational risks when this feature goes live.
What are a few operational concerns that will not be present at launch, but may be a concern later?
Can the new product feature be safely rolled back once it is live, can it be disabled using a feature flag?
Document every way the customer will interact with this new feature and how customers will be impacted by a failure of each interaction.
As a thought experiment, think of worst-case failure scenarios for this product feature, how can the blast-radius of the failure be isolated?

Database

If we use a database, is the data structure verified and vetted by the database team?
Do we have an approximate growth rate of the stored data (for capacity planning)?
Can we age data and delete data of a certain age?

Security and Compliance

Are we adding any new resources of the following type? (If yes, please list them here or link to a place where they are listed)
- AWS Accounts/GCP Projects
- New Subnets
- VPC/Network Peering
- DNS names
- Entry-points exposed to the internet (Public IPs, Load-Balancers, Buckets, etc...)
- Other (anything relevant that might be worth mention)
Secure Software Development Life Cycle (SSDLC)
- Is the configuration following a security standard? (CIS is a good baseline for example)
- All cloud infrastructure resources are labeled according to the Infrastructure Labels and Tags guidelines
- Were the GitLab security development guidelines followed for this feature?
- Do we have an automatic procedure to update the infrastructure (OS, container images, packages, etc...)
- Do we use IaC (Terraform) for all the infrastructure related to this feature? If not, what kind of resources are not covered?
  - Do we have secure static code analysis tools (kics or checkov) covering this feature's terraform?
- If there's a new terraform state:
  - Where is to terraform state stored, and who has access to it?
- Does this feature add secrets to the terraform state? If yes, can they be stored in a secrets manager?
- If we're creating new containers:
  - Are we using a distroless base image?
  - Do we have security scanners covering these containers?
    - kics or checkov for Dockerfiles for example
    - GitLab's container scanner for vulnerabilities
Identity and Access Management
- Are we adding any new forms of Authentication (New service-accounts, users/password for storage, OIDC, etc...)?
- Does it follow the least privilege principle?
If we are adding any new Data Storage (Databases, buckets, etc...)
- What kind of data is stored on each system? (secrets, customer data, audit, etc...)
- How is data rated according to our data classification standard (customer data is RED)
- Is data it encrypted at rest? (If the storage is provided by a GCP service, the answer is most likely yes)
- Do we have audit logs on data access?
Network security (encryption and ports should be clear in the architecture diagram above)
- Firewalls follow the least privilege principle (w/ network policies in Kubernetes or firewalls on cloud provider)
- Is the service covered by any DDoS protection solution (GCP/AWS load-balancers or Cloudflare usually cover this)
- Is the service covered by a WAF (Web Application Firewall)
Logging & Audit
- Has effort been made to obscure or elide sensitive customer data in logging?
Compliance
- Is the service subject to any regulatory/compliance standards? If so, detail which and provide details on applicable controls, management processes, additional monitoring, and mitigating factors.

Performance

Explain what validation was done following GitLab's performance guidelines. Please explain which tools were used and link to the results below.
Are there any potential performance impacts on the database when this feature is enabled at GitLab.com scale?
Are there any throttling limits imposed by this feature? If so how are they managed?
If there are throttling limits, what is the customer experience of hitting a limit?
For all dependencies external and internal to the application, are there retry and back-off strategies for them?
Does the feature account for brief spikes in traffic, at least 2x above the expected TPS?

Backup and Restore

Outside of existing backups, are there any other customer data that needs to be backed up for this product feature?
Are backups monitored?
Was a restore from backup tested?

Monitoring and Alerts

Is the service logging in JSON format and are logs forwarded to logstash?
Is the service reporting metrics to Prometheus?
How is the end-to-end customer experience measured?
Do we have a target SLA in place for this service?
Do we know what the indicators (SLI) are that map to the target SLA?
Do we have alerts that are triggered when the SLI's (and thus the SLA) are not met?
Do we have troubleshooting runbooks linked to these alerts?
What are the thresholds for tweeting or issuing an official customer notification for an outage related to this feature?
do the oncall rotations responsible for this service have access to this service?

Responsibility

Which individuals are the subject matter experts and know the most about this feature?
Which team or set of individuals will take responsibility for the reliability of the feature once it is in production?
Is someone from the team who built the feature on call for the launch? If not, why not?

Testing

Describe the load test plan used for this feature. What breaking points were validated?
For the component failures that were theorized for this feature, were they tested? If so include the results of these failure tests.
Give a brief overview of what tests are run automatically in GitLab's CI/CD pipeline for this feature?

Edited Mar 21, 2023 by Dave Smith