Dependency Proxy Readiness for gitlab.com

Problem to solve

The GitLab Dependency Proxy allows users to cache frequently used images from Docker Hub, to help speed up their pipelines and reduce reliance on external sources.

Until recently, it has been only available for self-managed instances that have enabled the Puma web servers. That work is nearing completion, gitlab-com/gl-infra&78 (closed), and soon the Dependency Proxy will be widely available for GitLab users.

However, the feature was not built by anyone in the Package Group, and we must ensure that are ready for the feature to be in production. We need time to better document architecture, conduct an operational risk assessment, performance testing and other product readiness tasks.

Intended users

Proposal

The GitLab Infrastructure team opened gitlab-com/gl-infra/readiness!16 (merged) to help prepare for the roll by:

Documenting the architecture of the Dependency Proxy
Identifying any operational or security risks
Documenting performance tests
Identifying subject matter experts and support plans

I propose that fill in the requested information and prepare for the general availability of the GitLab Dependency Proxy

Further details

Architecture

~~- [ ] Add architecture diagrams to this issue of feature components and how they interact with existing GitLab components. Include internal dependencies, ports, security policies, etc.~~

Describe each component of the new feature and enumerate what it does to support customer use cases.
- The proxy is essentially a localized cache of docker images that are held to allow for faster docker performance in CI. The major component will be the storage mechanism (object storage for GitLab.com), GitLab receives a request from the docker client to download an image, and it first downloads it into object storage, then returning the image in the response. On subsequent requests, it will only have to fetch the data from object storage rather than making external calls.
For each component and dependency, what is the blast radius of failures? Is there anything in the feature design that will reduce this risk?
- If the dependency proxy fails for some reason (loss of storage or connection), it will affect any CI users utilizing this feature. It can be turned off to return to the standard behavior of pulling images directly from external sources.
If applicable, explain how this new feature will scale and any potential single points of failure in the design.
- The single point of failure will be in the storage of the proxied images (blobs and manifests). If that storage connection is lost, the dependency proxy will return errors.

Operational Risk Assessment

What are the potential scalability or performance issues that may result with this change?
- The only major risk is in CI image pulls. If the dependency proxy gets overloaded and slows down, the image pulls could potentially slow down. This is why we require Puma to allow for more workers. The next step to promote higher scalability and performance is to move the image pull/download logic to workhorse as specified in #11548 (closed)
List the external and internal dependencies to the application (ex: redis, postgres, etc) for this feature and how the it will be impacted by a failure of that dependency.
- Registry storage: if the connection to object storage fails, then the dependency proxy will not function and when CI tries to pull images, they will error out.
- Gitlab Rails: This feature utilizes rails routes and controllers, if the main rails app is not serving requests, it will fail.
Were there any features cut or compromises made to make the feature launch?
- None apparent, this feature has not changed since it was originally released.
List the top three operational risks when this feature goes live.
- Only one. I write a script to pull a lot of images through dependency proxy and response time for other web requests is increased. With Puma we can simply spawn a lot of workers so we don't care if part of them are stuck while docker images are downloaded from docker hub and cached in GitLab before being served to docker client.
What are a few operational concerns that will not be present at launch, but may be a concern later?
- At launch, no data will exist within the proxy, but overtime, the dependency proxy will build up a large amount of docker images (not unlike the container registry), which could lead to excess disk usage.
Can the new product feature be safely rolled back once it is live, can it be disabled using a feature flag?
- It can be safely turned off via config settings, but not with a feature flag. It is completely safe to turn off the dependency proxy anytime while in use. The only affect will be pipelines that were previously sped up by the proxy, will return to normal speeds.
Document every way the customer will interact with this new feature and how customers will be impacted by a failure of each interaction.
- Customers will be able to enable the dependency proxy at the group level. It will then effectively cache the images they use in CI to speed up subsequent CI runs (no longer having to pull from external APIs). In case of an extreme failure (connection to object storage is lost), their pipelines may fail, however this can be remedied by turning the dependency proxy off.
As a thought experiment, think of worst-case failure scenarios for this product feature, how can the blast-radius of the failure be isolated?
- Use of Puma could cause complete backup of workers. A solution would possibly be to restrict some web machines to not serve dependency proxy requests.

Security

Were the gitlab security development guidelines followed for this feature?
- Not 100% positive, but this is a fairly confident "Yes".

~~- [ ] If this feature requires new infrastructure, will it be updated regularly with OS updates?~~

~~- [ ] Has effort been made to obscure or elide sensitive customer data in logging?~~

Is any potentially sensitive user-provided data persisted? If so is this data encrypted at rest?
- No, in it's current state it only will proxy requests and store images from the public docker registry.

Performance

Explain what validation was done following GitLab's performance guidlines please explain or link to the results below
Are there any potential performance impacts on the database when this feature is enabled at GitLab.com scale?
- Unlikely, this feature does not touch the database.
Are there any throttling limits imposed by this feature? If so how are they managed?
- No

~~- [ ] If there are throttling limits, what is the customer experience of hitting a limit?~~

~~- [ ] For all dependencies external and internal to the application, are there retry and back-off strategies for them?~~

Does the feature account for brief spikes in traffic, at least 2x above the expected TPS?
- Unknown. We never measured what can be the impact once users start to use it. I don't think we reach 2x though.

Backup and Restore

This is currently only a caching service and does not need backups.

Monitoring and Alerts

Is the service logging in JSON format and are logs forwarded to logstash?
- Logging is in the Rails logs
Is the service reporting metrics to Prometheus?
- No
How is the end-to-end customer experience measured?
- ? Tim ?
Do we have a target SLA in place for this service?
- ? Tim ?
Do we know what the indicators (SLI) are that map to the target SLA?
- ? Tim ?
Do we have alerts that are triggered when the SLI's (and thus the SLA) are not met?
- No
Do we have troubleshooting runbooks linked to these alerts?
- No
What are the thresholds for tweeting or issuing an official customer notification for an outage related to this feature?
- Nothing in place

Responsibility

Which individuals are the subject matter experts and know the most about this feature?
- The backend Package Team are responsible for this feature with #s_package being the Slack point of contact, and Dan Croft being the engineering manager point of contact with the backend engineers being the subject matter experts. The MVC (as it currently exists) was coded by Dmitriy Zaporozhets, as such he would be the premier expert initially.
Which team or set of individuals will take responsibility for the reliability of the feature once it is in production?
- The infrastructure team is responsible for the config, which consists of object storage setup and the gitlab.rb config that turns the feature on and off for the entire instance. The package team is responsible for the reliability of the functionality of the feature.
Is someone from the team who built the feature on call for the launch? If not, why not?
- I'm sure someone from the Package Team will be available to be on call, however, Dmitriy Zaporozhets built the feature and would likely be the fastest to answer any major concerns initially.

Testing

Describe the load test plan used for this feature. What breaking points were validated?
- There was no load test plan
For the component failures that were theorized for this feature, were they tested? If so include the results of these failure tests.
- This was not done.
Give a brief overview of what tests are run automatically in GitLab's CI/CD pipeline for this feature?
- The feature is supported by rspec unit (models/services/routes) and integration (controllers) tests.

What does success look like, and how can we measure that?

We understand what work is required to make the feature widely available.

Edited Jan 21, 2020 by Steve Abrams