Summary of (Waiting) Jobs and Monitoring for Runners

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

Problem to solve

When an on-premise GitLab installation manages multiple runners at once, it's hard to monitor them. More over it's even harder to decide when to install more runner power or when to shift runner power from smaller to bigger (or vise versa) machines in an virtual environment. GitLab should install monitoring and statistics for runners to ease administrative and business decisions.

Intended users

Further details

Currently, GitLab has no tool to observe:

inactive (no heartbeat), active and idle (waiting for job) runners
average job queue time
- per instance
- per project (or group)
- per runner tag
runner utilization per instance
- max/average cpu per run
- max/average RAM per run
runner utilization per project
- max/average cpu per run for one project(or group)
- max/average RAM per run for one project (or group)
runner utilization per runner tag
- max/average cpu per run
- max/average RAM per run
statistics per tag
- usage over time (per day, per week, per month)
- used together with other tags

Lets assume system of 30 runners tagged as (5x big, 10x medium, 15x small) installed with several compiler tools. These run on 8 big servers using VMware ESXi server to virtualize the hardware. The VMs reach from 2x vCPU, 4 GB RAM to 16 vCPU, 64 GB RAM.

These statistics could help to answer the following questions:

Do we need more runner capacities (in general)?
Do we need more RAM in runners?
Do we need more RAM in small runners?
Does a certain project need an exclusive big runner to reduced job queue times?
Do we need more runners, because quere time is high?
Does a project or group of projects use big jobs, which cause swapping?
What tags are triggered most?
Should another runner be installed offering this tag?
Are tags less used (e.g. older compiler versions)?
Which tags have long quere times?
Can a pipeline with e.g. 3 jobs be specified to run on big -> small -> medium?
Which runners support a certain tag?

For 1.) yes, inactive runners are shown, but to see running jobs, each runner must be clicked individually in the admin menu. Moreover this view is only visible to admins, but not to group owners, to see how many jobs in a project group are pending.

Proposal

Collect statistics from each runner, each job (and pipeline)
Display values in an overview page:
- per instance (admins)
- per group (owners)
- per project (maintainers)
Add graphs for usage over time to identify trends
Allow filtering

Summary of (Waiting) Jobs and Monitoring for Runners

Problem to solve

Intended users

Further details

Proposal

Permissions and Security

Documentation

Availability & Testing

What does success look like, and how can we measure that?

What is the type of buyer?

Is this a cross-stage feature?

Links / references