Skip to content

Enable Cloud Native Job Logs on production for a short period and measure the performance impacts

Currently our Cloud Native Job Logs CI Incremental Logging feature allows users to store live job logs directly in object storage, without needing an intermediate shared disk: https://docs.gitlab.com/ee/administration/job_logs.html#new-incremental-logging-architecture.

Unfortunately due to performance problems, this is not yet enabled on GitLab.com.

We should work to resolve this, so we can remove NFS in GitLab.com, and also ensure this feature works at scale as it is necessary for our cloud native architecture.

The next step is to re-enable incremental logging and determine what is causing the slow downs: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/4667

In order to move forward with the rollout plan all blocking issues must be resolved. These issues are shown below in the Is blocked by section. Once these issues are completed we can move forward with the rollout plan detailed below.

History

History taken from https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/4667

Metrics

Issues

Rollout plan

  1. Deploy new code to dev.gitlab.org and gitlab.com
  2. Collect links to metrics that we should monitor, update this issue description with them
  3. Create an issue in Infra issue tracker to involve SRE that will assist with the rollout
  4. Enable ci_enable_live_trace feature flag on dev.gitlab.org in sandbox project
  5. Design a pipeline with three builds, generating 32, 128, 192, 1024 kilobytes of build logs
  6. Create sandbox project dev.gitlab.org/grzegorz/live-traces-sandbox
  7. Verify that the feature works in the sandbox project on dev.gitlab.org
  8. Check metrics and exceptions on dev.gitlab.org
  9. Create sandbox project gitlab.com/grzesiek/live-traces-sandbox
  10. Run pipeline in the sandbox project on gitlab.com without live traces enabled
  11. Enable ci_enable_live_trace for the sandbox project on gitlab.com
  12. Run pipeline in the sandbox project on gitlab.com with live traces enabled
  13. Check build details page and raw job log for each test job
  14. Monitor exceptions and logs for this project, see if build logs work correctly there
  15. Ensure that ci_enable_live_trace feature flag is enabled on dev.gitlab.org
  16. Design a load testing pipeline for staging.gitlab.com
  17. Run load tests on staging.gitlab.com and post results here
  18. Redesign the test and run a new one on staging.gitlab.com
  19. Post results from second test into this issue
  20. Create a production change request to replicate the same test on gitlab.com
  21. Run the test on gitlab.com with SRE assistance
  22. Post results about the test into this issue
  23. Post a message about the feature rollout in #whats-happening-at-gitlab Slack channel
  24. Enable ci_enable_live_trace for gitlab-org/gitlab
  25. Monitor metrics and exceptions for around 3 days
  26. Disable the FF globally, leave it enabled for gitlab-org/gitlab,
  27. Post results to this issue.

Feature Toggles

Get status on gitlab.com: /chatops run feature get ci_enable_live_trace
Status on dev.gitlab.org: /chatops run feature get ci_enable_live_trace --dev
Enable in sandbox on dev: /chatops run feature set --project=grzegorz/live-traces-sandbox ci_enable_live_trace true --dev
Enable on dev.gitlab.org: /chatops run feature set ci_enable_live_trace true --dev
Enable in sandbox on com: /chatops run feature set --project=grzesiek/live-traces-sandbox ci_enable_live_trace true
Disable in sandbox on .c: /chatops run feature set --project=grzesiek/live-traces-sandbox ci_enable_live_trace false
Enable in test project on staging: /chatops run feature set --project=grzesiek/live-traces-tests ci_enable_live_trace true --staging
Edited by Grzegorz Bizon