Spike: Allow exporting job execution metrics to ClickHouse

We should create a spike for capturing job execution metrics for storage/analysis in ClickHouse, as part of the MVC proposal/Proposal 1 in gitlab-org/quality/analytics&22.

Proposal

Runner captures metrics regarding standard internal stages of job execution, reports them back to GitLab (whether in metrics.txt so we can continue leveraging the MR widget integration, or if necessary, as a JSON payload as the runner requests changing the job status to success/failed):

git_clone_duration_s 13
cache_download_duration_s 21
cache_upload_duration_s 3
artifact_upload_duration_s 17

the values would then be stored in ClickHouse's ci_finished_builds.

Questions

  • metrics.txt is typically for custom metrics from user jobs. Using it for runner-internal metrics might be confusing. The spike should evaluate if this is the right mechanism.

CI Functions (future direction)

Things become more generic as the step runner would simply capture the name and execution time for every function executed, and somehow report it back to GitLab so that the user is able to examine the timings of any CI function call in the pipeline. I'm not yet sure how we'd get the step runner to report back the execution times - one possibility is for it to simply append to the metrics.txt file? cc @josephburnett

Storage

The captured data is stored in the ci_finished_builds table, ideally by the existing ClickHouse::DataIngestion::CiFinishedBuildsSyncService. The choice for the format follows the following considerations:

  • adding new columns is not an issue in a columnar database such as ClickHouse,
  • given the dynamic nature of CI Functions we'll probably want to store it as a JSON field

If the goal is to perform special querying on the standard job stages, we might choose to have a mix of both: discrete columns for standard job stages and a JSON column for everything else.

Deliverables

  • Extensible architecture for collecting metrics from runner/step runner
  • ClickHouse data structure
  • How much data will this generate at scale?
  • Job ingestion performance impact

References

Edited by 🤖 GitLab Bot 🤖