Enable Cloud Native Job Logs on production for a short period and measure the performance impacts
Currently our Cloud Native Job Logs CI Incremental Logging feature allows users to store live job logs directly in object storage, without needing an intermediate shared disk: https://docs.gitlab.com/ee/administration/job_logs.html#new-incremental-logging-architecture.
Unfortunately due to performance problems, this is not yet enabled on GitLab.com.
We should work to resolve this, so we can remove NFS in GitLab.com, and also ensure this feature works at scale as it is necessary for our cloud native architecture.
The next step is to re-enable incremental logging and determine what is causing the slow downs: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/4667
In order to move forward with the rollout plan all blocking issues must be resolved. These issues are shown below in the Is blocked by
section. Once these issues are completed we can move forward with the rollout plan detailed below.
History
-
2019-10-23 13:04 UTC:
This feature has been disabled for Redis OOM incident. -
2019-08-16 13:14 UTC:
This feature has been enabled on gitlab-org/gitlab-ce, gitlab-org/gitlab-ee and gitlab-com/www-gitlab-com for evaluating a patch. -
2019-07-16 13:24 UTC:
This feature has been disabled for investigating trace loss. -
2019-07-12 17:15 UTC:
This feature has been enabled for the third evaluation. -
2019-04-18 07:13 UTC:
This feature has been disabled for investigating trace loss. -
2019-01-14 04:51 UTC:
This feature has been enabled for the second evaluation.
History taken from https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/4667
Metrics
- Sentry errors containing "trace" keyword -> https://sentry.gitlab.net/gitlab/gitlabcom/?query=trace
- API for a build trace append endpoint -> PATCH /api/jobs/:id/trace
- API for updating a build details / status -> PUT /api/jobs/:id
- Combined dashboard for /api/jobs/* -> PUT / PATCH /api/jobs/:id
- Build details page -> GET trace.json
- Build details page -> GET raw trace
- Redis memory -> Redis Overview Dashboard
Issues
- Production change issue for test project rollout
➡ gitlab-com/gl-infra/production#2497 (closed) - Infra issue
➡ https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11043
Rollout plan
-
Deploy new code to dev.gitlab.org and gitlab.com -
Collect links to metrics that we should monitor, update this issue description with them -
Create an issue in Infra issue tracker to involve SRE that will assist with the rollout -
Enable ci_enable_live_trace
feature flag on dev.gitlab.org in sandbox project -
Design a pipeline with three builds, generating 32, 128, 192, 1024 kilobytes of build logs -
Create sandbox project dev.gitlab.org/grzegorz/live-traces-sandbox -
Verify that the feature works in the sandbox project on dev.gitlab.org -
Check metrics and exceptions on dev.gitlab.org -
Create sandbox project gitlab.com/grzesiek/live-traces-sandbox -
Run pipeline in the sandbox project on gitlab.com without live traces enabled -
Enable ci_enable_live_trace
for the sandbox project on gitlab.com -
Run pipeline in the sandbox project on gitlab.com with live traces enabled -
Check build details page and raw job log for each test job -
Monitor exceptions and logs for this project, see if build logs work correctly there -
Ensure that ci_enable_live_trace
feature flag is enabled on dev.gitlab.org -
Design a load testing pipeline for staging.gitlab.com -
Run load tests on staging.gitlab.com and post results here -
Redesign the test and run a new one on staging.gitlab.com -
Post results from second test into this issue -
Create a production change request to replicate the same test on gitlab.com -
Run the test on gitlab.com with SRE assistance -
Post results about the test into this issue -
Post a message about the feature rollout in #whats-happening-at-gitlab
Slack channel -
Enable ci_enable_live_trace
forgitlab-org/gitlab
-
Monitor metrics and exceptions for around 3 days -
Disable the FF globally, leave it enabled for gitlab-org/gitlab
, -
Post results to this issue.
Feature Toggles
Get status on gitlab.com: /chatops run feature get ci_enable_live_trace
Status on dev.gitlab.org: /chatops run feature get ci_enable_live_trace --dev
Enable in sandbox on dev: /chatops run feature set --project=grzegorz/live-traces-sandbox ci_enable_live_trace true --dev
Enable on dev.gitlab.org: /chatops run feature set ci_enable_live_trace true --dev
Enable in sandbox on com: /chatops run feature set --project=grzesiek/live-traces-sandbox ci_enable_live_trace true
Disable in sandbox on .c: /chatops run feature set --project=grzesiek/live-traces-sandbox ci_enable_live_trace false
Enable in test project on staging: /chatops run feature set --project=grzesiek/live-traces-tests ci_enable_live_trace true --staging