Runner Fleet Parent Epic (FY22 - FY23)
## Status update (2023-06-08) These are the new epics to follow for the Runner Fleet roadmap: - https://gitlab.com/groups/gitlab-org/-/epics/9708+ - https://gitlab.com/groups/gitlab-org/-/epics/10797+ This parent epic collates all of the sub-epics that comprise the work effort for the Runner Fleet team. ## Vision The vision is to provide administrators with a birds-eye view and configuration management capabilities to administer a fleet, **tens of thousands** of GitLab Runners easily. **Vision walkthrough for search and filtering:** [GitLab Unfiltered - Vision for Runner Fleet Enterprise Management: Search and filter edition](https://www.youtube.com/watch?v=8nSos4pbMng) ## Guiding Principles We will emphasize actionable insights, simplicity, ease of use, engaging and modern tooling. ## Business outcomes A first-class enterprise management solution for managing runners at scale will positively impact IACV. This capability could be a competitive differentiator by providing the tooling for DevOps teams to reduce maintenance and operational costs of CI build machines. **Note** A fundamental assumption is that other solutions do not enter the market with a 100% serverless, zero-friction CI build platform. ## Strategy and Alignment with Runner Core One of the key strategic focus areas for the core Runner is the continued investment in the Kubernetes executor. It enables customers to efficiently and massively scale their build agent environment on-demand by running GitLab CI/CD jobs in Pods on Kubernetes. And with recent feature updates, the GitLab Runner Kubernetes executor now supports Windows containers in addition to, of course, Linux. However, due to several factors, there will continue to be demand from our users and customers for other executors. This means that the Runner Enterprise Fleet Management product development strategy must be flexible enough to accommodate the following. A Runner fleet based on Kubernetes (containerized CI/CD execution ) and Runner fleets with a mix of multiple executors, compute platforms, and architectures. ## Target user personas - Priyanka (Platform Engineer) - Devon (DevOps Engineer) - Cameron (Compliance Manager) - Sidney (Systems Administrator) ## Job to be done - The main job we need to consider in evolving the vision is that Priyanka is responsible for administering a DevOps platform for her organization. This means ensuring that developers can build, test and deploy software without being concerned about setting up the build or remote code execution environments. In other words, this should just work for the development teams and also work consistently and reliably. Now, even though the jobs to be done theory (a framework for understanding customer behavior), is meant to [focus](https://about.gitlab.com/handbook/engineering/ux/jobs-to-be-done/deep-dive/#what-isnt-a-jtbd) not on a specific solution, we have to narrow the aperture in the context of GitLab Runner Fleet Enterprise Management. The rationale is that the immediate goal is to address the shortcomings in the functionality and user experience of an in-market solution with adoption at scale. - In addition to the main job, Priyanka is also concerned about the security and compliance aspects related to administering GitLab for her organization. Even though Priyanka may not be the security or compliance lead, she will need to respond to security and compliance-related requests and be called on to provide proof of compliance. - Finally, in Priyanka’s day-to-day job, she needs to be confident that she has the tools to administer the GitLab build environment efficiently ### User stories The following user stories are a result of numerous customer interviews, and asynch discussions in issues submitted by customers. As we iterate on the various solutions to address the stories, it's possible that some of the requirements below may be implemented in other areas of the UI. 1. When responding to a security incident or notification of a compromise related to the insecure use of an instance, group, or project runner registration token, I want to quickly reset the registration token so that the old token can no longer be used to register a runner. 1. When helping a developer or team troubleshoot issues with a CI job that a "failing" runner could cause, I need to determine runner association and ownership. This includes quickly answering these questions; who registered the Runner, who manages the Runner, is the Runner an instance-level Shared Runner, or a group or project runner to troubleshoot and resolve the issue quickly. 1. When viewing runners associated with a GitLab instance or group, I want to be able to make configuration changes for a runner or runners, so that I can complete administrative tasks as promptly as possible. 1. When administering runners for a GitLab instance or group, I need an easy way to determine how many runners are out of date by x versions so that I can help with compliance enforcement. Note - the first step is identifying a runner that is out of date and then locating the runner. 1. When administering runners for a GitLab instance or group, I want to be able to specify that runner registration tokens be reset at a specified interval to reduce the likelihood of a compromised registration token being used to register a rogue runner. 1. When administering runners for a GitLab instance or group, I want to monitor the performance of runners and see which teams are using shared runners, group or project runners, so that I can proactively anticipate when I will need more runner capacity. 1. When checking on CI jobs' performance in a GitLab instance, I want to see pending and running jobs for the runner job queue(s) to quickly determine how long jobs may wait before being picked up by a runner. And as a result, determine if runner fleet configuration changes are immediately required to improve performance. 1. When administering runners for a GitLab instance or group, I want to test a connection for a given runner so that I can validate the runner can request and process CI jobs. (The solution for this JTDB is likely only going to be available to instance level admins.) 1. When administering runners for a GitLab instance or group, I want to delete inactive runners in bulk. ## Additional Context The current interface and features in the GitLab UI for administering runners have resulted in a significant number of problems for those customers who self-host large fleets of runners. A few of these problems include the following: - Individuals responsible for either administering a GitLab instance or supporting the development teams can sometimes spend a significant amount of time simply locating a runner that may be the cause of a failing CI job. - **Runner sprawl** has resulted in administrators not efficiently answering basic administrative and enterprise security-related questions: how many runners are associated with a GitLab instance, who owns or is responsible for each runner, are the runner versions up to date? - Lack of clear insights into build performance and the factors that can influence performance means that customers have to spend a significant amount of time simply gathering data. Questions such as are pipeline performance the same, better or worse? In the case of slower build times, is the issue related to the runner. For example, what type of instances is the runner hosted could be a factor in performance for some use cases. However, the challenge is that the customer does not have that single pane of glass answer in GitLab. Also, in interviews with several customers managing GitLab or runners at enterprise scale, a common theme is that they need help simplifying the management and operation of runners. These conversations have highlighted that we rely on additional tooling to manage the runner fleet for GitLab SaaS at the instance level. So for enterprise management of Runners at scale, we aren't using our product, [dogfooding](https://about.gitlab.com/handbook/values/#dogfooding), which is a core principle here at GitLab. ## Story map of current experience ![2021_Q3_Runner_UX_Scorecard](/uploads/3525626c87edbe59675112288f846918/2021_Q3_Runner_UX_Scorecard.png) [View Mural](https://app.mural.co/t/gitlab0631/m/gitlab0631/1629293415498/8b4f7b95d4852709b4b35d884d33b6bc17198c42?sender=u8acaa5f2dd7f92154f687467) This Mural will be updated with the design outcome of this Epic, so we will have a visual SSOT of before/after workflows based on the issues in this Epic. ## Themes ### Enhanced Search and Filtering - **Vision walkthrough for search and filtering:** [GitLab Unfiltered - Vision for Runner Fleet Enterprise Management: Search and filter edition](https://www.youtube.com/watch?v=8nSos4pbMng)
epic