Skip to content
Snippets Groups Projects

Expose queue duration related metrics in job payload sent to the runner

Merged Tomasz Maczukin requested to merge add-queued-for-to-job-payload into master

What does this MR do and why?

This MR adds a queued_for and project_jobs_running_on_instance_runners_count fields sent in the job_info section of job payload sent to GitLab Runner.

queued_for, representing the difference between time of response generation and time of scheduling the job for queueing (which is set with the queued_at field in the database record), Runner will know how long the job was being in the queue.

This value will be next used on the GitLab Runner side to generate a histogram metric, representing queueing times for each specific runner and [[runners]] worker that asks for jobs. This will allow runner administrators to track queueing times of jobs targeting their runners.

Similarly project_jobs_running_on_instance_runners, for jobs targeting instance runners (and only instance type runners!) will allow the runner administrators to see how scheduled jobs fit into the fair scheduling algorithm.

We will use such information, for example, to improve our SLI definitions for SaaS runners on GitLab.com. As currently our apdex is the same for all different runner types that we manage, as it's calculated from a general metric exposed from GitLab, which by design doesn't partition such information per runner.

With this change and the planned GitLab Runner change, each runner owner will be able to track such data with other runner metrics. And for us this means that we will be able to define different SLIs for each SaaS runner shard that we maintain.

The biggest value of this change is however in the fact, that this metric would become usable for self-hosted GitLab Runner instances. As for GitLab installations like GitLab.com, individual users who self-host their runners and would like to track queuing performance are unable to do that, as GitLab internal metrics are... well... internal :wink: Available only to GitLab instance administrators (so us in the case of GitLab.com). And the global metric like job_queue_duration_seconds exposed by GitLab can't be partitioned by the individual Runner ID, as such cardinality of data would quickly kill any Prometheus server.

By passing this data to Runner and exposing it there, each Runner owner can track the queuing timing of their own runner instances. Without a need for GitLab administrators to expose GitLab's metric and with the data being partitioned by each individual runner.

GitLab Runner update at gitlab-runner!3499 (merged).

Screenshots or screen recordings

N/A

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Grzegorz Bizon

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Grzegorz Bizon removed review request for @grzesiek

    removed review request for @grzesiek

  • Nice work @tmaczukin! I left a question about where to implement this logic in.

  • Tomasz Maczukin added 5588 commits

    added 5588 commits

    • 3d6de1a8...0ea8b489 - 5587 commits from branch master
    • 5f97f00c - Expose queue related metrics job payload sent to the runner

    Compare with previous version

  • Tomasz Maczukin changed the description

    changed the description

  • Tomasz Maczukin requested review from @grzesiek

    requested review from @grzesiek

  • Grzegorz Bizon
  • Thanks @tmaczukin! We are a bit closer to a good solution here! I left one question / suggestion about making one metric a bit less confusing here.

  • Grzegorz Bizon removed review request for @grzesiek

    removed review request for @grzesiek

  • Tomasz Maczukin added 58117 commits

    added 58117 commits

    Compare with previous version

  • 🤖 GitLab Bot 🤖 added sectionci label and removed sectionops label

    added sectionci label and removed sectionops label

  • A deleted user added databasereview pending label
  • Ghost User
  • Contributor
    3 Warnings
    :warning:

    featureaddition and featureenhancement merge requests normally have a documentation change. Consider adding a documentation update or confirming the documentation plan with the Technical Writer counterpart.

    For more information, see:

    :warning: There were no new or modified feature flag YAML files detected in this MR.

    If the changes here are already controlled under an existing feature flag, please add
    the feature flagexists. Otherwise, if you think the changes here don't need
    to be under a feature flag, please add the label feature flagskipped, and
    add a short comment about why we skipped the feature flag.

    For guidance on when to use a feature flag, please see the documentation.

    :warning: This merge request does not refer to an existing milestone.
    2 Messages
    :book: CHANGELOG missing:

    If you want to create a changelog entry for GitLab FOSS, add the Changelog trailer to the commit message you want to add to the changelog.

    If you want to create a changelog entry for GitLab EE, also add the EE: true trailer to your commit message.

    If this merge request doesn't need a CHANGELOG entry, feel free to ignore this message.

    :book: This merge request adds or changes files that require a review from the Database team.

    This merge request requires a database review. To make sure these changes are reviewed, take the following steps:

    1. Ensure the merge request has database and databasereview pending labels. If the merge request modifies database files, Danger will do this for you.
    2. Prepare your MR for database review according to the docs.
    3. Assign and mention the database reviewer suggested by Reviewer Roulette.

    If you no longer require a database review, you can remove this suggestion by removing the database label and re-running the danger-review job.

    Reviewer roulette

    Changes that require review have been detected!

    Please refer to the table below for assigning reviewers and maintainers suggested by Danger in the specified category:

    Category Reviewer Maintainer
    backend Aaron Huntsman current availability (@ahuntsman) (UTC-5, 7 hours behind @tmaczukin) Nicolas Dular current availability (@nicolasdular) (UTC+2, same timezone as @tmaczukin)
    ~"Verify" Reviewer review is optional for ~"Verify" Fabio Pitino current availability (@fabiopitino) (UTC+1, 1 hour behind @tmaczukin)
    ~"Verify backend" Reviewer review is optional for ~"Verify backend" Marius Bobin current availability (@mbobin) (UTC+3, 1 hour ahead of @tmaczukin)

    To spread load more evenly across eligible reviewers, Danger has picked a candidate for each review slot, based on their timezone. Feel free to override these selections if you think someone else would be better-suited or use the GitLab Review Workload Dashboard to find other available reviewers.

    To read more on how to use the reviewer roulette, please take a look at the Engineering workflow and code review guidelines. Please consider assigning a reviewer or maintainer who is a domain expert in the area of the merge request.

    Once you've decided who will review this merge request, assign them as a reviewer! Danger does not automatically notify them for you.

    If needed, you can retry the :repeat: danger-review job that generated this comment.

    Generated by :no_entry_sign: Danger

  • Tomasz Maczukin changed the description

    changed the description

  • Author Maintainer

    @grzesiek Finally, I had some time to get back to this MR. I've applied some of the changes we've discussed and resolved the conflicts.

    What do you think about this MR now?

  • Tomasz Maczukin requested review from @grzesiek

    requested review from @grzesiek

  • Contributor

    :warning: @tmaczukin Some end-to-end (E2E) tests have been selected based on the stage label on this MR.

    Please start the trigger-omnibus-and-follow-up-e2e job in the qa stage and ensure the tests in follow-up-e2e:package-and-test-ee pipeline are passing before this MR is merged. (The E2E test pipeline is computationally intensive and we cannot afford running it automatically for all pushes/rebases. Therefore, this job must be triggered manually after significant changes at least once.)

    If you would like to run all E2E tests, please apply the pipeline:run-all-e2e label and trigger a new pipeline. This will run all tests in e2e:package-and-test pipeline.

    The E2E test jobs are allowed to fail due to flakiness. For the list of known failures please refer to the latest pipeline triage issue.

    Once done, please apply the :white_check_mark: emoji on this comment.

    For any questions or help in reviewing the E2E test results, please reach out on the internal #quality Slack channel.

  • I left two questions @tmaczukin!

  • Grzegorz Bizon removed review request for @grzesiek

    removed review request for @grzesiek

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading