Document complex runner scenarios
@tmaczukin recently wrote https://docs.google.com/document/d/1WYmN5oukY3DK2hPFLPkxwnuyfxES8nNPeDLMTN_KhVM/edit#heading=h.x90mgc8nj9ty (internal only) which documents the runner architecture for GitLab.com and dev.gitlab.org. It reminded me that we (probably) don't have a complex scenario like this documented. In particular, we have multiple runner managers that each autoscale hundreds of individual runners.
- One reason for needing that is because the manager can only handle managing so many runners before they start to fall over. This is probably a function of VM machine size chosen for the managers, but customers may face the same challenges.
- Another reason is redundancy. If you only have one manager, then it's a single point of failure. Having two or more managers managing a fleet of runners can be more reliable.
- For even more reliability, having managers on different cloud providers, or at least different regions, lets you survive outages.
- You could have a manager that costs more as a backup, but disabled or limited unless/until there's a problem with a primary, cheaper cloud.
- We also have multiple manages because we have fleets of runners with different characteristics. e.g. we have the shared runner fleet with generic capabilities, then we also have a runner fleet specifically tuned for
gitlab-ce
andgitlab-ee
.
Perhaps there are other reasons. Let's document them!
Also, we seem to using the same fleet of runners across two GitLab installations. That deserves documentation as well as many customers have several GitLab instances and might want to have a shared fleet.
/cc @bikebilly @axil