Implications on cost and stability of GitLab.com when introducing implicit Auto DevOps
We are currently working on adding a used-by-all CI configuration. You can read more about this in https://gitlab.com/gitlab-org/gitlab-ce/issues/34777. This feature allows us to bundle one CI configuration, configuration that will be used for everyone that has CI/CD Pipelines enabled, and does not have their own .gitlab-ci.yml
.
I did check the implications for making the Auto DevOps enabled implicitly on GitLab.com
Looking at GitLab.com data it seems that on average 15% of pushes do result in Pipeline being created. Reference data below.
If we were to enable Auto DevOps on GitLab.com it means that we would start processing around 5.6 times more pipelines than we process today.
Given that Auto DevOps in minimal example would run 1 job (only build phase), but in maximal 6 (build, test, code quality, staging, canary, production). If we assume that 50% would fail on 1 job (build phase) due to lack of Dockerfile needed to build and push to Container Registry, if we assume that 45% would run only test phase (test and code quality) and actually 5% would use DevOps features (staging, canary, production) it results that on average we would run: 2.15 jobs per-pipeline. It means that we would effectively process 12.04 times more jobs that we process today. This means that our jobs count processed on GitLab.com should raise by that number, as we have very unfavorable billing as of today we pay per-hour for every job run on Shared Runners.
This also introduces the need to scale Runners infrastructure, as current configuration of Shared Runners Manager would be able to process at most 6x more jobs than we have today.
This leads me to conclusion that due to significant cost of compute needed to support Auto DevOps that is implicitly enabled for everyone we need to implement additional measures to prevent overspending of money.
The proposed approaches:
- Disable implicit Auto DevOps on GitLab.com,
- Enable Auto DevOps for everyone, but disable it when first pipeline that runs does fail.
The per-project configuration (https://gitlab.com/gitlab-org/gitlab-ce/issues/34777#note_38530157) proposal allows us to implement the proposal 2. However, we need to be aware of a possible deficiency: if someone pushes a new repo with a lot of branches (100) we would run a pipeline for each of the branch. This would happen only for new projects. This means that would make a yet to undefined bump in processing cost of CI jobs because we would have to at least run 2.15 jobs for each project on GitLab.com in some undefined future. In the proposal 2. we also may need to be prepared for the extra compute in the beginning of 10.0 deployment. It means that at least in the first days we would run a significant amount of extra pipelines, due to projects not yet being disabled.
Given that, even if we follow and implement 2. (implicit configuration enabled on GitLab.com) this option should be disabled in the beginning. We should enable first in a test-phase period to ensure that we don't hit the problems of infrastructure scaling due to extra CI jobs that we will process after enabling the feature. If everything goes well we could then safely enable it permanently.
This brings a good question, if this configuration option should be Application Setting, but as well Feature Setting stored in Flipper. @zj mentions that Flipper gives that advantage that we do a "roll out" of the feature, and enable it only to 5-10% in the beginning.
Since this has significant implications on cost of running infrastructure, and potential implications of performance and stability of GitLab.com I need your opinion: @sytses @markpundsack @bikebilly @sitschner @stanhu. Please involve more people if you feel that we need them to make a decision which approach should be followed when introducing that feature.
Reference metrics:
> Ci::Pipeline.where('created_at > ?', 12.hour.ago).count
=> 17356
> Event.code_push.where('created_at > ?', 12.hour.ago).count
=> 109138
> Ci::Pipeline.where('created_at > ?', 1.day.ago).count
=> 32178
> Event.code_push.where('created_at > ?', 24.hour.ago).count
=> 192747
> Ci::Pipeline.where('created_at > ?', 7.day.ago).count
=> 194808
> Event.code_push.where('created_at > ?', 7.day.ago).count
=> 1305249
Monitor Closely During Feature Flag Rollout
- total bucket size for container registry
- CPU load or response times from registry
- CI load (like we normally monitor anyway)
- time series:
SELECT COUNT(*), enabled FROM project_auto_devops GROUP BY enabled
- number of pipelines trigger by autodevops. (There's special repository_source for that stored in pipeline row.)