Improve the reliability and observability of the container registry client
🍋 Context
Container Registry objects are not all present in the rails backend. For objects that are not in the rails database (such as image tags), the backend has to query the container registry api.
For those interactions with the container registry api, a single class is used: https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/container_registry/client.rb. This class is essentially a DSL to access the container registry. It's using a Faraday client with different options to access the container registry API.
The problem we have with the current state is:
- We do observe some intermittent network timeouts(internal).
- This brings some issues on operations that uses this client such as cleanup policies execution.
See #280545 (closed) for more details.
This MR tries to tackle two aspects of this client:
- Reliability
- Be more resilient against network timeout errors.
- Observability
- Properly log the network errors and the associated context (eg. which url in the container registry api was accessed)
- This will help us to provide some traces to the infrastructure team so that they can investigate those errors.
🔭 What does this MR do?
- Add network timeouts
- To keep things simple here, we're going to re use https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/http.rb#L21-25
- Add the faraday retry middleware.
- We're going to start simple: retry only once.
- Add a custom faraday middleware to be able to observe network errors.
- This middleware will have the option of setting a
callbackthat is called when an error is triggered. - The client will use a callback where the error is logged with the usual utilities.
- This middleware will have the option of setting a
🖥 Screenshots (strongly suggested)
Example of an timeout error logged
{"severity":"ERROR","time":"2020-12-31T10:54:22.465Z","correlation_id":"01ETW6XQ7WHYB9P7GZE0KZSZWG","tags.correlation_id":"01ETW6XQ7WHYB9P7GZE0KZSZWG","tags.locale":"en","user.id":1,"user.email":"admin@example.com","user.username":"root","extra.class":"ContainerRegistry::Client","extra.url":"http://gdk.test:5050/v2/gitlab.org/ci-cd/package-stage/cicd_102/story-of-beans/tags/list","exception.class":"Faraday::TimeoutError","exception.message":"Net::ReadTimeout with #\u003cTCPSocket:(closed)\u003e","exception.backtrace":["lib/gitlab/faraday/error_callback.rb:12:in `call'","lib/container_registry/client.rb:72:in `repository_tags'","app/models/container_repository.rb:68:in `manifest'","lib/gitlab/metrics/instrumentation.rb:160:in `block in manifest'","lib/gitlab/metrics/method_call.rb:27:in `measure'","lib/gitlab/metrics/instrumentation.rb:160:in `manifest'","app/models/container_repository.rb:82:in `tags_count'","lib/gitlab/metrics/instrumentation.rb:160:in `block in tags_count'","lib/gitlab/metrics/method_call.rb:27:in `measure'","lib/gitlab/metrics/instrumentation.rb:160:in `tags_count'","app/graphql/types/base_field.rb:46:in `block (2 levels) in resolve_field'","app/graphql/types/base_field.rb:35:in `block in resolve_field'","app/graphql/types/base_field.rb:31:in `resolve_field'","lib/gitlab/graphql/calls_gitaly/instrumentation.rb:18:in `block in instrument'","lib/gitlab/graphql/generic_tracing.rb:40:in `with_labkit_tracing'","lib/gitlab/graphql/generic_tracing.rb:30:in `platform_trace'","lib/gitlab/graphql/generic_tracing.rb:40:in `with_labkit_tracing'","lib/gitlab/graphql/generic_tracing.rb:30:in `platform_trace'","app/graphql/gitlab_schema.rb:43:in `multiplex'","app/controllers/graphql_controller.rb:68:in `execute_multiplex'","app/controllers/graphql_controller.rb:32:in `execute'","ee/lib/gitlab/ip_address_state.rb:10:in `with'","ee/app/controllers/ee/application_controller.rb:44:in `set_current_ip_address'","app/controllers/application_controller.rb:495:in `set_current_admin'","lib/gitlab/session.rb:11:in `with_session'","app/controllers/application_controller.rb:486:in `set_session_storage'","lib/gitlab/i18n.rb:73:in `with_locale'","lib/gitlab/i18n.rb:79:in `with_user_locale'","app/controllers/application_controller.rb:480:in `set_locale'","lib/gitlab/error_tracking.rb:52:in `with_context'","app/controllers/application_controller.rb:545:in `sentry_context'","app/controllers/application_controller.rb:473:in `block in set_current_context'","lib/gitlab/application_context.rb:56:in `block in use'","lib/gitlab/application_context.rb:56:in `use'","lib/gitlab/application_context.rb:22:in `with_context'","app/controllers/application_controller.rb:464:in `set_current_context'","lib/gitlab/metrics/elasticsearch_rack_middleware.rb:16:in `call'","lib/gitlab/middleware/rails_queue_duration.rb:33:in `call'","lib/gitlab/metrics/rack_middleware.rb:16:in `block in call'","lib/gitlab/metrics/transaction.rb:56:in `run'","lib/gitlab/metrics/rack_middleware.rb:16:in `call'","lib/gitlab/request_profiler/middleware.rb:17:in `call'","lib/gitlab/query_limiting/middleware.rb:17:in `block in call'","lib/gitlab/query_limiting/transaction.rb:39:in `run'","lib/gitlab/query_limiting/middleware.rb:16:in `call'","lib/gitlab/jira/middleware.rb:19:in `call'","lib/gitlab/middleware/go.rb:20:in `call'","lib/gitlab/etag_caching/middleware.rb:21:in `call'","lib/gitlab/middleware/multipart.rb:234:in `call'","lib/gitlab/middleware/read_only/controller.rb:50:in `call'","lib/gitlab/middleware/read_only.rb:18:in `call'","lib/gitlab/middleware/same_site_cookies.rb:27:in `call'","lib/gitlab/middleware/handle_malformed_strings.rb:21:in `call'","lib/gitlab/middleware/basic_health_check.rb:25:in `call'","lib/gitlab/middleware/handle_ip_spoof_attack_error.rb:25:in `call'","lib/gitlab/middleware/request_context.rb:23:in `call'","config/initializers/fix_local_cache_middleware.rb:9:in `call'","lib/gitlab/middleware/static.rb:11:in `call'","lib/gitlab/webpack/dev_server_middleware.rb:34:in `perform_request'","lib/gitlab/metrics/requests_rack_middleware.rb:76:in `call'","lib/gitlab/middleware/release_env.rb:12:in `call'"]}
Notice the presence of exception.class, extra.class and extra.url with the proper values.
🛃 Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry - [-] Documentation (if required)
- [-] Code review guidelines
- [-] Merge request performance guidelines
- [-] Style guides
- [-] Database guides
- [-] Separation of EE specific content
Availability and Testing
- [-] Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
- [-] Tested in all supported browsers
- [-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
- [-] Label as security and @ mention
@gitlab-com/gl-security/appsec - [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
- [-] Security reports checked/validated by a reviewer from the AppSec team