Improve monitoring of specific & group runners on gitlab.com
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Problem to solve
Currently it's very difficult to get performance and queue information for specific and group runners on GitLab.com. When you're exclusively using specific runners (like we are) due to the high demands of your build jobs this presents a problem. The monitoring and administration of specific & group runners needs to be improved.
Further details
The CI pipelines for our Embedded Linux projects build lots of large packages for multiple target machines. These builds require at least 16GB RAM and 50GB free disk space so aren't suitable for running on the GitLab.com shared runners. We therefore have our own bare-metal dedicated runner (1 at the minute, likely to scale up soon) for these tasks. On this runner the builds take anywhere from 20 minutes to a couple of hours each. Due the the way CPU and disk I/O contention slows down the builds, it's actually counter productive to run more than 2 CI jobs in parallel on our runner so we sometimes have a queue of jobs waiting to run.
We have performance monitoring in place for the runner machine itself (using InfluxDB+Grafana) so we don't need to capture load information. The questions we're looking to answer are more to do with how many jobs are queue, how many have ran recently, etc.
These are the only features I miss from Teamcity. It may be useful to look at their feature documentation for comparison: https://confluence.jetbrains.com/display/TCD18/Viewing+Agents+Workload
We're using gitlab.com at the Bronze subscription level. We don't currently have any reason to self-host GitLab so we'd like to continue using GitLab.com. So I'd like to see these features available for regular users of GitLab.com - i.e. where users don't have admin access to the GitLab instance.
Proposal
-
Have "owner" and "maintainer" roles for each specific or group runner. This allows us to share the administrative workload whilst controlling access.
-
Provide a monitoring page for each specific or group runner which the owners + maintainers can access.
-
Allow a runner to be paused/unpaused from the monitoring page to aid in maintenance and debug of the underlying machine.
-
Show the current queue on the monitoring page if possible.
-
Show the current status of the runner on the monitoring page - either idle or a link to each running job.
-
Show recently finished jobs on this runner on the monitoring page.
-
Show usage statistics for the runner on the monitoring page. How long was it active over the last day/week/month? How many jobs have ran/passed/failed? This is useful to inform when we need to scale up and deploy more runners. It's also useful to investigate failures caused by issues with a runner itself and not with the software under test.
-
Allow alerts to be configured if a runner goes offline unexpectedly.