Skip to content

starting updated Thanos Readiness

Dave Smith requested to merge reliability/update-thanos-readiness into master

What

Updating the Thanos readiness review per #64 (closed)

Why

With &813 (closed), the ~"team::Observability" has brought some changes to Thanos that warrant an updated readiness review.

Checklist from the MR template in issues/64:

While it is encouraged for parts of this document to be filled out, not all of the items below will be relevant. Leave all non-applicable items intact and add the reasons for why in place of the response. This Guide is just that, a Guide. If something is not asked, but should be, it is strongly encouraged to add it as necessary.

Summary

  • Provide a high level summary of this new product feature. Explain how this change will benefit GitLab customers. Enumerate the customer use-cases.

  • What metrics, including business metrics, should be monitored to ensure will this feature launch will be a success?

Architecture

  • Add architecture diagrams to this issue of feature components and how they interact with existing GitLab components. Make sure to include the following: Internal dependencies, ports, encryption, protocols, security policies, etc.

  • Describe each component of the new feature and enumerate what it does to support customer use cases.

  • For each component and dependency, what is the blast radius of failures? Is there anything in the feature design that will reduce this risk?

  • If applicable, explain how this new feature will scale and any potential single points of failure in the design.

Operational Risk Assessment

  • What are the potential scalability or performance issues that may result with this change?

  • List the external and internal dependencies to the application (ex: redis, postgres, etc) for this feature and how the service will be impacted by a failure of that dependency.

  • Were there any features cut or compromises made to make the feature launch?

  • List the top three operational risks when this feature goes live.

  • What are a few operational concerns that will not be present at launch, but may be a concern later?

  • Can the new product feature be safely rolled back once it is live, can it be disabled using a feature flag?

  • Document every way the customer will interact with this new feature and how customers will be impacted by a failure of each interaction.

  • As a thought experiment, think of worst-case failure scenarios for this product feature, how can the blast-radius of the failure be isolated?

Database

  • If we use a database, is the data structure verified and vetted by the database team?

  • Do we have an approximate growth rate of the stored data (for capacity planning)?

  • Can we age data and delete data of a certain age?

Security and Compliance

  • Are we adding any new resources of the following type? (If yes, please list them here or link to a place where they are listed)

    • AWS Accounts/GCP Projects

    • New Subnets

    • VPC/Network Peering

    • DNS names

    • Entry-points exposed to the internet (Public IPs, Load-Balancers, Buckets, etc...)

    • Other (anything relevant that might be worth mention)

  • Secure Software Development Life Cycle (SSDLC)

    • Is the configuration following a security standard? (CIS is a good baseline for example)

    • All cloud infrastructure resources are labeled according to the Infrastructure Labels and Tags guidelines

    • Were the GitLab security development guidelines followed for this feature?

    • Do we have an automatic procedure to update the infrastructure (OS, container images, packages, etc...)

    • Do we use IaC (Terraform) for all the infrastructure related to this feature? If not, what kind of resources are not covered?

      • Do we have secure static code analysis tools (kics or checkov) covering this feature's terraform?
    • If there's a new terraform state:

      • Where is to terraform state stored, and who has access to it?
    • Does this feature add secrets to the terraform state? If yes, can they be stored in a secrets manager?

    • If we're creating new containers:

      • Are we using a distroless base image?

      • Do we have security scanners covering these containers?

        • kics or checkov for Dockerfiles for example

        • GitLab's container scanner for vulnerabilities

  • Identity and Access Management

    • Are we adding any new forms of Authentication (New service-accounts, users/password for storage, OIDC, etc...)?

    • Does it follow the least privilege principle?

  • If we are adding any new Data Storage (Databases, buckets, etc...)

    • What kind of data is stored on each system? (secrets, customer data, audit, etc...)

    • How is data rated according to our data classification standard (customer data is RED)

    • Is data it encrypted at rest? (If the storage is provided by a GCP service, the answer is most likely yes)

    • Do we have audit logs on data access?

  • Network security (encryption and ports should be clear in the architecture diagram above)

    • Firewalls follow the least privilege principle (w/ network policies in Kubernetes or firewalls on cloud provider)

    • Is the service covered by any DDoS protection solution (GCP/AWS load-balancers or Cloudflare usually cover this)

    • Is the service covered by a WAF (Web Application Firewall)

  • Logging & Audit

    • Has effort been made to obscure or elide sensitive customer data in logging?
  • Compliance

    • Is the service subject to any regulatory/compliance standards? If so, detail which and provide details on applicable controls, management processes, additional monitoring, and mitigating factors.

Performance

  • Explain what validation was done following GitLab's performance guidelines. Please explain which tools were used and link to the results below.

  • Are there any potential performance impacts on the database when this feature is enabled at GitLab.com scale?

  • Are there any throttling limits imposed by this feature? If so how are they managed?

  • If there are throttling limits, what is the customer experience of hitting a limit?

  • For all dependencies external and internal to the application, are there retry and back-off strategies for them?

  • Does the feature account for brief spikes in traffic, at least 2x above the expected TPS?

Backup and Restore

  • Outside of existing backups, are there any other customer data that needs to be backed up for this product feature?

  • Are backups monitored?

  • Was a restore from backup tested?

Monitoring and Alerts

  • Is the service logging in JSON format and are logs forwarded to logstash?

  • Is the service reporting metrics to Prometheus?

  • How is the end-to-end customer experience measured?

  • Do we have a target SLA in place for this service?

  • Do we know what the indicators (SLI) are that map to the target SLA?

  • Do we have alerts that are triggered when the SLI's (and thus the SLA) are not met?

  • Do we have troubleshooting runbooks linked to these alerts?

  • What are the thresholds for tweeting or issuing an official customer notification for an outage related to this feature?

  • do the oncall rotations responsible for this service have access to this service?

Responsibility

  • Which individuals are the subject matter experts and know the most about this feature?

  • Which team or set of individuals will take responsibility for the reliability of the feature once it is in production?

  • Is someone from the team who built the feature on call for the launch? If not, why not?

Testing

  • Describe the load test plan used for this feature. What breaking points were validated?

  • For the component failures that were theorized for this feature, were they tested? If so include the results of these failure tests.

  • Give a brief overview of what tests are run automatically in GitLab's CI/CD pipeline for this feature?

Edited by Dave Smith

Merge request reports