[meta] Steps towards CI production readiness
This is meta issue to describe the mid-term plan for making CI production-ready on GitLab.com.
This issue describes steps to make a full switch:
Done in 9.3:
-
Improve handling of API failures: gitlab-org/gitlab-ci-multi-runner#2098,
For 9.4:
-
Move DO auth tokens to one master account: https://gitlab.com/gitlab-com/infrastructure/issues/1941 -
Create alert based on Runner's error rate and machines in removing state count: #1942 (closed), -
Setup monitoring for CI: #1266 (closed), -
Configure Digital Ocean exporter: https://gitlab.com/gitlab-com/infrastructure/issues/1658,
For 9.5:
-
Automatic clean-up of unmanaged droplets on DO: #921 (closed), -
Use build image as a base image for all CI jobs: https://gitlab.com/gitlab-com/infrastructure/issues/1980 (moved from 9.4; needs to be finished), -
Introduce IDS into images: #1277,
For 10.0, 10.1, 10.2:
-
Create consul server for Prometheus service discovery: #1639 (closed) (moved from 9.4), -
Create Prometheus server for monitoring: #1640 (closed), -
Prepare a Prometheus exporter for GCE machines usage: #2585 (closed) -
Resolve logging problems on GCE runners: #1931 (closed),
Next releases:
-
Modify Runner's upgrade process: #1811 (closed), -
Create HA for Runner's cache/registry-mirror servers: #1982 (closed) -
Automate creation and provisioning of runners-cache-X machines: #1995 (closed)
-
-
Finish cache servers monitoring and alerting improvements: #2116 (closed)
Having done this it will allow us to
- Use Digital Ocean and Google Compute Engine to run CI jobs,
- Use optimised base image with IDS and monitoring configured,
- Monitor CI and gracefully handle the failures,
- Use GCE
Edited by Kamil Trzciński