Improve visibility of time spent with external IO
Problem
It's reasonable to say that time spent doing external requests could be total - (cpu + redis + postgres + gitaly)
, though there's currently no easy way to filter by that amount in Kibana (or Prometheus dashboards).
Proposal
- Similar to how we keep track of the number of Gitaly, Redis and Database calls, we could add the number of external calls and total duration to the logs. For example:
{"severity":"INFO","time":"2020-12-22T07:28:10.505Z","queue":"cronjob:elastic_index_initial_bulk_cron","args":[],"class":"ElasticIndexInitialBulkCronWorker", "duration_s":0.727714,"cpu_s":0.198394,"completed_at":"2020-12-22T07:28:10.505Z","db_duration_s":0.000356, "external_io": 0.23}
-
In prometheus we could add a counter that has the same labels as the ones coming from
Gitlab::Metrics::Transaction
and a label for their success or not. This will allow us to have error rates for these per transaction. -
In the performance bar, there should be one more section for external IOs. Each item in this section should include HTTP method, path, hostname, request headers, response headers, request body (?), response body (?), and latency (from the caller viewpoint).
-
Similarly, in the tracing records, the same amount of information should be available.
The main problem of this task is that the requests to 3rd party services are handled by different HTTP libraries:
- A big portion of 3rd requests made by Gitlab are currently proxied by
Gitlab::HTTP
, either done through Web or Sidekiq, which could be intercepted and instrumented. Underlying, Gitlab uses HTTParty. - In some gems (
hipchat
andflowdoc
), HTTParty is used directly. - Many gems use Faraday, including,
ms_rest
,ms_rest_azure
,oauth2
,octokit
,sawyer
,sentry
,signet
,acme-client
,asana
,azure
,googleauth
,graphclient
. - Other less common http clients:
-
rest-client
: used bydiscordrb
,kubeclient
-
exon
: used byfog-*
,
-
- And an unknown numbers of gems and lines of code are using Net::HTTP directly
To encounter the above problem, there may be two solutions:
- Implement the instrumentation middlewares/patches for each of the above HTTP client. If a client is hard/impossible, just ignore. This seems to be error-prone.
- Since most of the above HTTP clients (Gitlab::HTTP, HttpParty, RestClient, and Faraday) depend on Ruby's
Net::HTTP
. Other notable backend for the HTTP client isexcon
, which implements the HTTP protocol layer upon Ruby TCP Socket. We can instrument layer on top of the core methods. By doing so, I believe most of the external IOs are captured. This approach is kinda common within Ruby's instrumentation frameworks (datadog, new relic, instana).
In either solution, the instrumentation broadcasts an ActiveSupport::Notification event, then the subscribers subscribe to these events and dispatch into desired location (instrumentation logs, prometheus metrics, tracings, performance bars, etc.).
-
Implement instrumenters for external HTTP requests in labkit -
Not sure whether I should add more information, such as request headers, response headers, and so on. -
Requests to Jeager endpoints are captured. It's eligible, but noisy. There should be a blocklist option for the instrumenter. -
Implement a publisher for httpclient. It turns out some gems (Google API Client and Elastic Search) detect http client automatically, and fallback to HTTPClient. I haven't implemented it yet because this MR is quite big. -
Integrate External HTTP notifications into Lograge -
Integrate External HTTP notifications into Gitlab Metrics Transaction -
Evaluate security risks as the external URI (host or path or query) may contains sensitive information. -
Update Performance Bar doc -
Update Gitlab Prometheus Metrics