Dedicated service for Internal API

Production Readiness

For any new or changes to a feature or service in production, the questions in this guide will help to make these changes more robust when they are enabled on GitLab.com.

Before starting, please review the Production Readiness Review document in the handbook.

This issue serves as a tracking issue to guide you through the readiness review. It's not the production readiness document itself!.

The readiness documentation will be added to the project with a merge request, where different interested parties can collaborate.

Readiness MR

Dedicated service for Internal API (!149 - merged)

Reviewers

The reviewers should be filled in as one of the steps of the checklist below. If a reviewer in the "Mandatory" section is not allocated, please add the reason why next to the name.

If you are unsure who should be assigned as a reviewer, please reach out to any Infrastructure Engineering Manager for assistance.

To have a reviewer assigned from the InfraSec team, please create an issue in the issue tracker dedicated to Business as Usual (BAU). This will help the team to triage the review and start working on it. More information is available on the team's handbook page. After the issue is created, put a link to the issue next to Infrasec reviewer item below and add the reviewer name after one has been assigned.

The reviewer will check the box next to their name when the review is complete

Mandatory

Reliability: @jarv
Delivery: @skarbek
InfraSec: @mattmorrison

Readiness Checklist

The following items should be completed by the person initiating the readiness review:

Create this issue and assign it to yourself. Set a due-date for when you believe the readiness will be completed (this can be updated later if necessary).
Review the Production Readiness Review handbook page.
In the "Reviewers" section above, add the reviewer names. Names will be assigned by reaching out to the engineering manager of the corresponding team.
Create the first draft of the readiness review by copying the template below and submitting an MR, add the label workflow-infraIn Progress to this issue.
Add a link to the MR in the "Readiness MR" section at the top of this issue
Assign the initial set reviewers to the MR. There can be multiple iterations of MR if needed, often it is helpful to have the first draft reviewed by team members in the same team. Approval of the MR does not mean the readiness document is approved, approvals will be done later on this issue.
When last review of the MR is complete, ask the reviewers in the "Reviewers" section above to check the box next to their name if they are satisfied with the review and have no more questions or concerns.
(Optional) If it is later decided to not proceed with this proposal, add workflow-infraCancelled and close this issue

When all boxes have been check in the "Reviewers" section, add the workflow-infraDone label and close the issue.

Readiness MR Template

Expand the section below to view the readiness template, this will be the starting point for the readiness merge request.

Create <name>/index.md as a new merge request with the following content where is something short and descriptive for the change being proposed

The Readiness Review document is designed to help you prepare your features and services for the GitLab Production Platforms. Please engage with the relevant teams as soon as possible to begin review even if there are incomplete items below. All sections should be completed up to the current maturity level. For example, if the target maturity is "Beta", then items under "Experiment" and "Beta" should be completed.

While it is encouraged for parts of this document to be filled out, not all of the items below will be relevant. Leave all non-applicable items intact and add 'N/A' or reasons for why in place of the response. This Guide is just that, a Guide. If something is not asked, but should be, it is strongly encouraged to add it as necessary.

Experiment

Service Catalog

The items below will be reviewed by the Reliability team.

Link to the service catalog entry for the service. Ensure that the following items are present in the service catalog, or listed here:
- Link to or provide a high-level summary of this new product feature.
- Link to the Architecture Design Workflow for this feature, if there wasn't a design completed for this feature please explain why.
- List the feature group that created this feature/service and who are the current Engineering Managers, Product Managers and their Directors.
- List individuals are the subject matter experts and know the most about this feature.
- List the team or set of individuals will take responsibility for the reliability of the feature once it is in production.
- List the member(s) of the team who built the feature will be on-call for the launch.
- List the external and internal dependencies to the application (ex: redis, postgres, etc) for this feature and how the service will be impacted by a failure of that dependency.

Infrastructure

The items below will be reviewed by the Reliability team.

Do we use IaC (e.g., Terraform) for all the infrastructure related to this feature? If not, what kind of resources are not covered?
Is the service covered by any DDoS protection solution (GCP/AWS load-balancers or Cloudflare usually cover this)?
Are all cloud infrastructure resources labeled according to the Infrastructure Labels and Tags guidelines?

Operational Risk

The items below will be reviewed by the Reliability team.

List the top three operational risks when this feature goes live.
For each component and dependency, what is the blast radius of failures? Is there anything in the feature design that will reduce this risk?

Monitoring and Alerting

The items below will be reviewed by the Reliability team.

Link to the metrics catalog for the service

Deployment

The items below will be reviewed by the Delivery team.

Will a change management issue be used for rollout? If so, link to it here.
Can the new product feature be safely rolled back once it is live, can it be disabled using a feature flag?
How are the artifacts being built for this feature (e.g., using the CNG or another image building pipeline).

Security Considerations

The items below will be reviewed by the Infrasec team.

Link or list information for new resources of the following type:
- AWS Accounts/GCP Projects:
- New Subnets:
- VPC/Network Peering:
- DNS names:
- Entry-points exposed to the internet (Public IPs, Load-Balancers, Buckets, etc...):
- Other (anything relevant that might be worth mention):
Were the GitLab security development guidelines followed for this feature?
Was an Application Security Review requested, if appropriate? Link it here.
Do we have an automatic procedure to update the infrastructure (OS, container images, packages, etc...). For example, using unattended upgrade or renovate bot to keep dependencies up-to-date?
For IaC (e.g., Terraform), is there any secure static code analysis tools like (kics or checkov)? If not and new IaC is being introduced, please explain why.
If we're creating new containers (e.g., a Dockerfile with an image build pipeline), are we using kics or checkov to scan Dockerfiles or GitLab's container scanner for vulnerabilities?

Identity and Access Management

The items below will be reviewed by the Infrasec team.

Are we adding any new forms of Authentication (New service-accounts, users/password for storage, OIDC, etc...)?
Was effort put in to ensure that the new service follows the least privilege principle, so that permissions are reduced as much as possible?
Do firewalls follow the least privilege principle (w/ network policies in Kubernetes or firewalls on cloud provider)?
Is the service covered by a WAF (Web Application Firewall) in Cloudflare?

Logging, Audit and Data Access

The items below will be reviewed by the Infrasec team.

Did we make an effort to redact customer data from logs?
What kind of data is stored on each system (secrets, customer data, audit, etc...)?
How is data rated according to our data classification standard (customer data is RED)?
Do we have audit logs for when data is accessed? If you are unsure or if using Reliability's central logging and a new pubsub topic was created, create an issue in the Security Logging Project using the add-remove-change-log-source template.
Ensure appropriate logs are being kept for compliance and requirements for retention are met.
If the data classification = Red for the new environment, please create a Security Compliance Intake issue. Note this is not necessary if the service is deployed in existing Production infrastructure.

Beta

Monitoring and Alerting

The items below will be reviewed by the Reliability team.

Link to examples of logs on https://logs.gitlab.net
Link to the Grafana dashboard for this service.

Backup, Restore, DR and Retention

The items below will be reviewed by the Reliability team.

Are there custom backup/restore requirements?
Are backups monitored?
Was a restore from backup tested?
Link to information about growth rate of stored data.

Deployment

The items below will be reviewed by the Delivery team.

Will a change management issue be used for rollout? If so, link to it here.
Does this feature have any version compatibility requirements with other components (e.g., Gitaly, Sidekiq, Rails) that will require a specific order of deployments?
Is this feature validated by our QA blackbox tests?
Will it be possible to roll back this feature? If so explain how it will be possible.

Security

The items below will be reviewed by the InfraSec team.

Put yourself in an attacker's shoes and list some examples of "What could possibly go wrong?". Are you OK going into Beta knowing that?
Link to any outstanding security-related epics & issues for this feature. Are you OK going into Beta with those still on the TODO list?

General Availability

Monitoring and Alerting

The items below will be reviewed by the Reliability team.

Link to the troubleshooting runbooks.
Link to an example of an alert and a corresponding runbook.
Confirm that on-call Reliability SREs have access to this service and will be on-call. If this is not the case, please add an explanation here.

Operational Risk

The items below will be reviewed by the Reliability team.

Link to notes or testing results for assessing the outcome of failures of individual components.
What are the potential scalability or performance issues that may result with this change?
What are a few operational concerns that will not be present at launch, but may be a concern later?
Are there any single points of failure in the design? If so list them here.
As a thought experiment, think of worst-case failure scenarios for this product feature, how can the blast-radius of the failure be isolated?

Backup, Restore, DR and Retention

The items below will be reviewed by the Reliability team.

Are there any special requirements for Disaster Recovery for both Regional and Zone failures beyond our current Disaster Recovery processes that are in place?
How does data age? Can data over a certain age be deleted?

Performance, Scalability and Capacity Planning

The items below will be reviewed by the Reliability team.

Link to any performance validation that was done according to performance guidelines.
Link to any load testing plans and results.
Are there any potential performance impacts on the Postgres database or Redis when this feature is enabled at GitLab.com scale?
Explain how this feature uses our rate limiting features.
Are there retry and back-off strategies for external dependencies?
Does the feature account for brief spikes in traffic, at least 2x above the expected rate?

Deployment

The items below will be reviewed by the Delivery team.

Will a change management issue be used for rollout? If so, link to it here.
Are there healthchecks or SLIs that can be relied on for deployment/rollbacks?
Does building artifacts or deployment depend at all on gitlab.com?

Edited Dec 07, 2023 by Vladimir Glafirov