Skip to content

Cloud Seed Production Readiness Review

Sri Rang requested to merge sri19-master-patch-56601 into master

Cloud Seed Introduction

What is Cloud Seed?

Cloud Seed is an incubation project that allows GitLab.com customers to connect their own Google Cloud accounts and consume GCP services. As of now, after authorization, users can generate services accounts, enable Cloud Run and configure deployment pipelines from GitLab to Cloud Run.

Impact on gitlab-rails

The following source code files were introduced/modified in this feature:

Frontend

  • app/assets/javascripts/google_cloud/*
  • app/views/projects/google_cloud/*

Backend

  • config/routes/project.rb Lines 317 to 326
  • lib/sidebars/projects/menus/infrastructure_menu.rbLines 92 to 106
  • lib/google_api/cloud_platform/client.rb Lines 91 to 153
  • app/services/google_cloud/*
  • app/controllers/projects/google_cloud_controller.rb
  • app/controllers/projects/google_cloud/*

Specs and tests were added for these in the appropriate locations ..via the following MRs:

Related issues

Sisense Metrics Dashboard

https://app.periscopedata.com/app/gitlab/1021503/Cloud-Seed---Monthly-Usage

Summary

  • Provide a high level summary of this new product feature. Explain how this change will benefit GitLab customers. Enumerate the customer use-cases.

See above

  • What metrics, including business metrics, should be monitored to ensure will this feature launch will be a success?

See dashboard above

Architecture

## Architecture

Standard Rails application components:

  • Controllers
  • Services
  • Frontend Vue components ..introduced at the GitLab --> Group --> Project level.

Zero new dependencies.

Feature set captured in videos here: https://about.gitlab.com/handbook/engineering/incubation/cloud-seed/ that explain how the end user may use GitLab.com to deploy to their Google Cloud account.

  • Add architecture diagrams to this issue of feature components and how they interact with existing GitLab components. Include internal dependencies, ports, security policies, etc.

Standard Rails components: Controllers, Services, Vue.js Frontend Components

  • Describe each component of the new feature and enumerate what it does to support customer use cases.

Components enable user to provision and deploy to their Google Cloud accounts via GitLab -> Project -> Infrastructure section.

  • For each component and dependency, what is the blast radius of failures? Is there anything in the feature design that will reduce this risk?

These are standard Rails controllers, which could be DDOSed and injected to, typical web application risk profile.

  • If applicable, explain how this new feature will scale and any potential single points of failure in the design.

The risk is Google APIs are down, which will result in errors being raised in our app and appropriate Rails views being rendered.

Nothing special, standard web application scenarios.

Operational Risk Assessment

  • What are the potential scalability or performance issues that may result with this change?

Google Cloud APIs are being used, which could be laggy or unresponsive when their services go down.

  • List the external and internal dependencies to the application (ex: redis, postgres, etc) for this feature and how the it will be impacted by a failure of that dependency.

No external dependencies added. Using Google API clients which are already depended upon. No new infra resource used.

  • Were there any features cut or compromises made to make the feature launch?

No.

  • List the top three operational risks when this feature goes live.
  1. Google APIs are down
  2. Users generate deployment credentials to their Google Cloud accounts (Service Accounts), but are poor with secret management leading to leaked creds
  • What are a few operational concerns that will not be present at launch, but may be a concern later?

None

  • Can the new product feature be safely rolled back once it is live, can it be disabled using a feature flag?

Yes, FF is incubation_5mp_google_cloud

  • Document every way the customer will interact with this new feature and how customers will be impacted by a failure of each interaction.

Done, see Docs link above.

  • As a thought experiment, think of worst-case failure scenarios for this product feature, how can the blast-radius of the failure be isolated?

GitLab.com secret CI vars get leaked on masse, which leads to users' Google Cloud Service Accounts getting leaked.

Our only recourse is to invalidate those Service Accounts from our CI Vars and ask users to destroy their Service Accounts.

Typical leaked credentials scenario.

Database

  • If we use a database, is the data structure verified and vetted by the database team?
  • Do we have an approximate growth rate of the stored data (for capacity planning)?
  • Can we age data and delete data of a certain age?

Not applicable.

Security and Compliance

  • Were the gitlab security development guidelines followed for this feature? Yes
  • If this feature requires new infrastructure, will it be updated regularly with OS updates? No new infra used.
  • Has effort been made to obscure or elide sensitive customer data in logging? No sensitive PII being logged.
  • Is any potentially sensitive user-provided data persisted? If so is this data encrypted at rest? User generated Google Service Accounts are stored in project CI vars. See description in the previous sections.
  • Is the service subject to any regulatory/compliance standards? If so, detail which and provide details on applicable controls, management processes, additional monitoring, and mitigating factors. Not to my best knowledge.

Performance

  • Explain what validation was done following GitLab's performance guidlines please explain or link to the results below
  • which needs no new infrastructure
  • which does not alter the database schema
  • nor introduces any new dependencies ..the typical recurring performance tests by Quality is sufficient.
  • Are there any potential performance impacts on the database when this feature is enabled at GitLab.com scale? No because database is not altered. The only persistence that occurs uses project CI vars APIs.
  • Are there any throttling limits imposed by this feature? If so how are they managed? No.
  • If there are throttling limits, what is the customer experience of hitting a limit? No.
  • For all dependencies external and internal to the application, are there retry and back-off strategies for them? Not yet, but in future a worker is being developed that retries talking to Google APIs in case of async operations (for example provisioning a database instance takes ~ 5mins).
  • Does the feature account for brief spikes in traffic, at least 2x above the expected TPS? Being standard Rails controllers, built in spike management capabilities are assigned. Nothing specific built here, but we have a feature flag incubation_5mp_google_cloud who's roll-out percent can be tweaked.

Backup and Restore

  • Outside of existing backups, are there any other customer data that needs to be backed up for this product feature? No
  • Are backups monitored? n/a
  • Was a restore from backup tested? n/a

Monitoring and Alerts

  • Is the service logging in JSON format and are logs forwarded to logstash? Snowplow
  • Is the service reporting metrics to Prometheus? Snowplow
  • How is the end-to-end customer experience measured? Snowplow
  • Do we have a target SLA in place for this service? n/a. Incubation / experimental failure. And the user is warned that this experimental.
  • Do we know what the indicators (SLI) are that map to the target SLA? n/a.
  • Do we have alerts that are triggered when the SLI's (and thus the SLA) are not met? n/a.
  • Do we have troubleshooting runbooks linked to these alerts? n/a.
  • What are the thresholds for tweeting or issuing an official customer notification for an outage related to this feature? None. The user is constantly warned that this is an experimental / incubation feature.
  • do the oncall rotations responsible for this service have access to this service? Yes

Responsibility

  • Which individuals are the subject matter experts and know the most about this feature? @sri19, @bmarnane
  • Which team or set of individuals will take responsibility for the reliability of the feature once it is in production? @sri19, @bmarnane
  • Is someone from the team who built the feature on call for the launch? If not, why not? @sri19, @bmarnane

Testing

  • Describe the load test plan used for this feature. What breaking points were validated? Standard Rails load tests. No extra infra, dependency, db migration or library introduced.
  • For the component failures that were theorized for this feature, were they tested? If so include the results of these failure tests. Yes, see links to specs above.
  • Give a brief overview of what tests are run automatically in GitLab's CI/CD pipeline for this feature? Thorough test suite included, see specs above.

Merge request reports