Skip to content

Experiment with pushing internal job metrics to InfluxDB

David Dieulivol requested to merge 416597-job_profiling_to_influxDB into master

Context

Related to #416597. In particular, have a look at the Technical Implementation.

What does this MR do and why?

Disabled with CI_JOB_METRICS_ENABLED not set to true

See https://gitlab.com/gitlab-org/gitlab/-/jobs/4921091933#L158:

$ tooling/bin/push_job_metrics || true
[job-metrics] Feature disabled because CI_JOB_METRICS_ENABLED is not set to true.

General

Add CI job metrics to InfluxDB. Specifically, we add the rspec_retried_in_new_process metric, to track when a job triggered a new RSpec process.

Results

In !125546 (83f7d9f3), we're displaying the metrics instead of pushing them. THis change additionally makes some specs fail, so that we can test whether we'll push metrics in all scenarios 😄. Below is one of the failed job its JSON output:

{:name=>"job-metrics", :time=>2023-07-27 13:50:34 +0000, :tags=>{:job_finished_at=>"2023-07-27T14:07:32+00:00", :job_name=>"rspec fail-fast", :job_stage=>"test", :job_started_at=>"2023-07-27T14:02:40Z", :job_status=>"running", :project_id=>"278964", :rspec_retried_in_new_process=>"true", :server_host=>"gitlab.com"}, :fields=>{:job_id=>"4752968967", :merge_request_iid=>"125546", :pipeline_id=>"947415645"}}

Note the rspec_retried_in_new_process key set to true 🎉

How to set up and validate locally

# Should fail because `JOB_METRICS_FILE_PATH` isn't set
tooling/bin/create_job_metrics_file || true
echo $?

export JOB_METRICS_FILE_PATH=tmp/job-metrics.json
rm -rf $JOB_METRICS_FILE_PATH
tooling/bin/create_job_metrics_file || true # Should succeed

# Check the metrics file
cat $JOB_METRICS_FILE_PATH | jq .

# Update the metrics file
tooling/bin/update_job_metrics_tag rspec_retried_in_new_process 1 || true

# Check the metrics file
cat $JOB_METRICS_FILE_PATH | jq .

# Push the metrics to influxDB - should fail since you don't have the env variables configured
tooling/bin/push_job_metrics || true
echo $?

🆕 In case we need to push less often to InfluxDB 🆕

Push only half of the time to InfluxDB.

# We push metrics 50% of the time.
if rand < 0.5
  puts "[job-metrics] Will not push to influxDB (we only push in 50% of the cases)."
  exit(1)
end

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Nao Hashizume

Merge request reports