Skip to content

Istio Ingress Production Readiness

Marcel Chacon requested to merge mchacon-istio-ingress-readiness into master

The Readiness Review document is designed to help you prepare your features and services for the GitLab Production Platforms. Please engage with the relevant teams as soon as possible to begin review even if there are incomplete items below. All sections should be completed up to the current maturity level. For example, if the target maturity is "Beta", then items under "Experiment" and "Beta" should be completed.

While it is encouraged for parts of this document to be filled out, not all of the items below will be relevant. Leave all non-applicable items intact and add 'N/A' or reasons for why in place of the response. This Guide is just that, a Guide. If something is not asked, but should be, it is strongly encouraged to add it as necessary.

Experiment

Service Catalog

The items below will be reviewed by the Reliability team.

  • Link to the service catalog entry for the service. Ensure that the following items are present in the service catalog, or listed here:
    • Link to or provide a high-level summary of this new product feature.
    • Link to the Architecture Design Workflow for this feature, if there wasn't a design completed for this feature please explain why.
    • List the feature group that created this feature/service and who are the current Engineering Managers, Product Managers and their Directors.
    • List individuals are the subject matter experts and know the most about this feature.
    • List the team or set of individuals will take responsibility for the reliability of the feature once it is in production.
    • List the member(s) of the team who built the feature will be on call for the launch.
    • List the external and internal dependencies to the application (ex: redis, postgres, etc) for this feature and how the service will be impacted by a failure of that dependency.

Infrastructure

The items below will be reviewed by the Reliability team.

  • Do we use IaC (e.g., Terraform) for all the infrastructure related to this feature? If not, what kind of resources are not covered?
  • Is the service covered by any DDoS protection solution (GCP/AWS load-balancers or Cloudflare usually cover this)?
  • Are all cloud infrastructure resources labeled according to the Infrastructure Labels and Tags guidelines?

Operational Risk

The items below will be reviewed by the Reliability team.

  • List the top three operational risks when this feature goes live.
  • For each component and dependency, what is the blast radius of failures? Is there anything in the feature design that will reduce this risk?

Monitoring and Alerting

The items below will be reviewed by the Reliability team.

Deployment

The items below will be reviewed by the Delivery team.

  • Will a change management issue be used for rollout? If so, link to it here.
  • Can the new product feature be safely rolled back once it is live, can it be disabled using a feature flag?
  • How are the artifacts being built for this feature (e.g., using the CNG or another image building pipeline).

Security Considerations

The items below will be reviewed by the Infrasec team.

  • Link or list information for new resources of the following type:
    • AWS Accounts/GCP Projects:
    • New Subnets:
    • VPC/Network Peering:
    • DNS names:
    • Entry-points exposed to the internet (Public IPs, Load-Balancers, Buckets, etc...):
    • Other (anything relevant that might be worth mention):
  • Were the GitLab security development guidelines followed for this feature?
  • Was an Application Security Review requested, if appropriate? Link it here.
  • Do we have an automatic procedure to update the infrastructure (OS, container images, packages, etc...). For example, using unattended upgrade or renovate bot to keep dependencies up-to-date?
  • For IaC (e.g., Terraform), is there any secure static code analysis tools like (kics or checkov)? If not and new IaC is being introduced, please explain why.
  • If we're creating new containers (e.g., a Dockerfile with an image build pipeline), are we using kics or checkov to scan Dockerfiles or GitLab's container scanner for vulnerabilities?

Identity and Access Management

The items below will be reviewed by the Infrasec team.

  • Are we adding any new forms of Authentication (New service-accounts, users/password for storage, OIDC, etc...)?
  • Was effort put in to ensure that the new service follows the least privilege principle, so that permissions are reduced as much as possible?
  • Do firewalls follow the least privilege principle (w/ network policies in Kubernetes or firewalls on cloud provider)?
  • Is the service covered by a WAF (Web Application Firewall) in Cloudflare?

Logging, Audit and Data Access

The items below will be reviewed by the Infrasec team.

  • Did we make an effort to redact customer data from logs?
  • What kind of data is stored on each system (secrets, customer data, audit, etc...)?
  • How is data rated according to our data classification standard (customer data is RED)?
  • Do we have audit logs for when data is accessed? If you are unsure or if using Reliability's central logging and a new pubsub topic was created, create an issue in the Security Logging Project using the add-remove-change-log-source template.
  • Ensure appropriate logs are being kept for compliance and requirements for retention are met.
  • If the data classification = Red for the new environment, please create a Security Compliance Intake issue. Note this is not necessary if the service is deployed in existing Production infrastructure.

Beta

Monitoring and Alerting

The items below will be reviewed by the Reliability team.

Backup, Restore, DR and Retention

The items below will be reviewed by the Reliability team.

  • Are there custom backup/restore requirements?
  • Are backups monitored?
  • Was a restore from backup tested?
  • Link to information about growth rate of stored data.

Deployment

The items below will be reviewed by the Delivery team.

  • Will a change management issue be used for rollout? If so, link to it here.
  • Does this feature have any version compatibility requirements with other components (e.g., Gitaly, Sidekiq, Rails) that will require a specific order of deployments?
  • Is this feature validated by our QA blackbox tests?
  • Will it be possible to roll back this feature? If so explain how it will be possible.

Security

The items below will be reviewed by the InfraSec team.

  • Put yourself in an attacker's shoes and list some examples of "What could possibly go wrong?". Are you OK going into Beta knowing that?
  • Link to any outstanding security-related epics & issues for this feature. Are you OK going into Beta with those still on the TODO list?

General Availability

Monitoring and Alerting

The items below will be reviewed by the Reliability team.

  • Link to the troubleshooting runbooks.
  • Link to an example of an alert and a corresponding runbook.
  • Confirm that on-call engineers have access to this service.

Operational Risk

The items below will be reviewed by the Reliability team.

  • Link to notes or testing results for assessing the outcome of failures of individual components.
  • What are the potential scalability or performance issues that may result with this change?
  • What are a few operational concerns that will not be present at launch, but may be a concern later?
  • Are there any single points of failure in the design? If so list them here.
  • As a thought experiment, think of worst-case failure scenarios for this product feature, how can the blast-radius of the failure be isolated?

Backup, Restore, DR and Retention

The items below will be reviewed by the Reliability team.

  • Are there any special requirements for Disaster Recovery for both Regional and Zone failures beyond our current Disaster Recovery processes that are in place?
  • How does data age? Can data over a certain age be deleted?

Performance, Scalability and Capacity Planning

The items below will be reviewed by the Reliability team.

  • Link to any performance validation that was done according to performance guidelines.
  • Link to any load testing plans and results.
  • Are there any potential performance impacts on the Postgres database or Redis when this feature is enabled at GitLab.com scale?
  • Explain how this feature uses our rate limiting features.
  • Are there retry and back-off strategies for external dependencies?
  • Does the feature account for brief spikes in traffic, at least 2x above the expected rate?

Deployment

The items below will be reviewed by the Delivery team.

  • Will a change management issue be used for rollout? If so, link to it here.
  • Are there healthchecks or SLIs that can be relied on for deployment/rollbacks?
  • Does building artifacts or deployment depend at all on gitlab.com?

Related Issue: #87 (closed)

Edited by Marcel Chacon

Merge request reports