Summary of (Waiting) Jobs and Monitoring for Runners
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Problem to solve
When an on-premise GitLab installation manages multiple runners at once, it's hard to monitor them. More over it's even harder to decide when to install more runner power or when to shift runner power from smaller to bigger (or vise versa) machines in an virtual environment. GitLab should install monitoring and statistics for runners to ease administrative and business decisions.
Intended users
- Rachel (Release Manager)
- Delaney (Development Team Lead)
- Devon (DevOps Engineer)
- Sidney (Systems Administrator)
Further details
Currently, GitLab has no tool to observe:
- inactive (no heartbeat), active and idle (waiting for job) runners
- average job queue time
- per instance
- per project (or group)
- per runner tag
- runner utilization per instance
- max/average cpu per run
- max/average RAM per run
- runner utilization per project
- max/average cpu per run for one project(or group)
- max/average RAM per run for one project (or group)
- runner utilization per runner tag
- max/average cpu per run
- max/average RAM per run
- statistics per tag
- usage over time (per day, per week, per month)
- used together with other tags
Lets assume system of 30 runners tagged as (5x big, 10x medium, 15x small) installed with several compiler tools. These run on 8 big servers using VMware ESXi server to virtualize the hardware. The VMs reach from 2x vCPU, 4 GB RAM to 16 vCPU, 64 GB RAM.
These statistics could help to answer the following questions:
- Do we need more runner capacities (in general)?
- Do we need more RAM in runners?
- Do we need more RAM in
smallrunners? - Does a certain project need an exclusive
bigrunner to reduced job queue times? - Do we need more runners, because quere time is high?
- Does a project or group of projects use
bigjobs, which cause swapping? - What tags are triggered most?
Should another runner be installed offering this tag? - Are tags less used (e.g. older compiler versions)?
- Which tags have long quere times?
- Can a pipeline with e.g. 3 jobs be specified to run on
big->small->medium? - Which runners support a certain tag?
For 1.) yes, inactive runners are shown, but to see running jobs, each runner must be clicked individually in the admin menu. Moreover this view is only visible to admins, but not to group owners, to see how many jobs in a project group are pending.
Proposal
- Collect statistics from each runner, each job (and pipeline)
- Display values in an overview page:
- per instance (admins)
- per group (owners)
- per project (maintainers)
- Add graphs for usage over time to identify trends
- Allow filtering
Permissions and Security
Documentation
Availability & Testing
What does success look like, and how can we measure that?
What is the type of buyer?
Is this a cross-stage feature?
Links / references
/cc @jyavorska, @dimitrieh, @DarrenEastman, @kbychu