[FF] `ci_ansi2json_v2` -- V2 of the CI job log ANSI->JSON parser
<!-- Title suggestion: [FF] `ci_ansi2json_v2` -- V2 of the CI job log ANSI->JSON parser -->
## Summary
This issue is to roll out [the V2 ANSI->JSON parser](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/234882) on
production, which is currently behind the `ci_ansi2json_v2` feature flag.
V2 is a Ruby-idiomatic reimplementation of `Gitlab::Ci::Ansi2json` that's
~2.3-2.5x faster and allocates ~60% less memory than V1, with byte-identical
output (verified by running the V1 spec corpus against both implementations
via shared examples).
## Owners
- Most appropriate Slack channel to reach out to: `#g_pipeline-execution`
- Best individual to reach out to: @stanhu, @ajwalker
## Expectations
### What are we expecting to happen?
The job log viewer (`/<project>/-/jobs/<id>/trace.json`) returns identical
output to before, but with reduced parser CPU time and memory allocation.
The only production call site is `Ci::BuildTrace`, which the `trace.json`
controller action constructs — so the rollout's surface area is just that
endpoint:
- Initial loads of the legacy job log viewer
- Live trace polling while a CI job runs (each poll re-invokes the parser
on the new bytes)
The HMAC-signed state blob is interchangeable across V1 and V2, so a
polling client whose previous request was served by V1 can resume parsing
under V2 (and vice-versa) without any client-side change.
### What can go wrong and how would we detect it?
- **Output divergence from V1**: would surface as visual differences in the
job log viewer (missing/extra styling, broken section folding, mis-rendered
text). Unlikely given byte-identical spec coverage but worth watching for
during canary.
- **HMAC state failures**: would surface as 500s on `/trace.json` polls.
V2 reuses V1's `State` class so the encoding format is unchanged, but a
regression here would be immediately visible in error rates.
- **Performance regression**: would surface as elevated `cpu_s` / `db_duration_s`
for `Projects::JobsController#trace`. Unexpected, but worth confirming the
benchmarked wins translate to production.
Dashboards to watch:
- [Web - Endpoint Detail](https://dashboards.gitlab.net) filtered to
`controller=Projects::JobsController, action=trace`
- Endpoint p50/p95 `duration_s`, `cpu_s`, `mem_bytes`, error rate
- Kibana: `json.controller: "Projects::JobsController" and json.action: "trace"`
for per-request inspection
## Rollout Steps
Note: Please make sure to run the chatops commands in the Slack channel that gets impacted by the command.
### Rollout on non-production environments
- Verify the MR with the feature flag is merged to `master` and has been deployed to non-production environments with `/chatops gitlab run auto_deploy status <merge-commit-of-your-feature>`
- [x] Deploy the feature flag at a percentage (recommended percentage: 50%) with `/chatops gitlab run feature set ci_ansi2json_v2 50 --actors --dev --pre --staging --staging-ref`
- [x] Monitor that the error rates did not increase (repeat with a different percentage as necessary).
- [x] Enable the feature globally on non-production environments with `/chatops gitlab run feature set ci_ansi2json_v2 true --dev --pre --staging --staging-ref`
- [x] Verify that the feature works as expected.
The best environment to validate the feature in is [`staging-canary`](https://about.gitlab.com/handbook/engineering/infrastructure/environments/#staging-canary) as this is the first environment deployed to. Make sure you are [configured to use canary](https://next.gitlab.com/).
- [ ] If the feature flag causes end-to-end tests to fail, disable the feature flag on staging to avoid blocking [deployments](https://about.gitlab.com/handbook/engineering/deployments-and-releases/deployments/).
### Specific rollout on production
For visibility, all `/chatops` commands that target production must be executed in the [`#production` Slack channel](https://gitlab.slack.com/archives/C101F3796)
and cross-posted (with the command results) to the responsible team's Slack channel.
- Ensure that the feature MRs have been deployed to both production and canary with `/chatops gitlab run auto_deploy status <merge-commit-of-your-feature>`
- [x] Enable for `gitlab-org/gitlab-runner` first as a small canary (the project the real-world benchmark fixture came from):
`/chatops gitlab run feature set --project=gitlab-org/gitlab-runner ci_ansi2json_v2 true`
- [ ] Verify that job logs render identically under V2 by browsing recent jobs in that project.
- [ ] Expand to `gitlab-org/gitlab` and `gitlab-com/www-gitlab-com`:
`/chatops gitlab run feature set --project=gitlab-org/gitlab,gitlab-org/gitlab-foss,gitlab-com/www-gitlab-com ci_ansi2json_v2 true`
- [ ] Monitor `Projects::JobsController#trace` endpoint metrics for ~24 hours.
### Preparation before global rollout
- [ ] Set a milestone to this rollout issue to signal for enabling and removing the feature flag when it is stable.
- [ ] Check if the feature flag change needs to be accompanied with a
[change management issue](https://about.gitlab.com/handbook/engineering/infrastructure-platforms/change-management/#feature-flags-and-the-change-management-process).
Cross link the issue here if it does.
- [ ] Ensure that you or a representative in development can be available for at least 2 hours after feature flag updates in production.
### Global rollout on production
- [ ] [Incrementally roll out](https://docs.gitlab.com/development/feature_flags/controls/#process) the feature on production:
- `/chatops gitlab run feature set ci_ansi2json_v2 1 --actors`
- `/chatops gitlab run feature set ci_ansi2json_v2 10 --actors`
- `/chatops gitlab run feature set ci_ansi2json_v2 50 --actors`
- `/chatops gitlab run feature set ci_ansi2json_v2 100 --actors`
- Between every step wait for at least 15 minutes and monitor the appropriate graphs on https://dashboards.gitlab.net.
- [ ] After the feature has been 100% enabled, wait for [at least one day before releasing the feature](#release-the-feature).
### Release the feature
After the feature has been [deemed stable](https://about.gitlab.com/handbook/product-development-flow/feature-flag-lifecycle/#including-a-feature-behind-feature-flag-in-the-final-release),
the [clean up](https://docs.gitlab.com/development/feature_flags/controls/#cleaning-up)
should be done as soon as possible to permanently enable the feature and reduce
complexity in the codebase.
- [ ] Create a merge request to clean up the V1 implementation. The MR should:
- Move the contents of `lib/gitlab/ci/ansi2json/v2/` up to `lib/gitlab/ci/ansi2json/`, replacing the V1 files.
- Drop the `V2` namespace and rename `V2::Converter` -> `Converter`, `V2::AnsiEvaluator` (replacing `Parser` + `Style`), `V2::Line`, `V2::State`.
- Drop the `it_behaves_like 'an ansi2json converter'` describe for `V2` in `spec/lib/gitlab/ci/ansi2json_spec.rb` (only one describe needed).
- Remove the FF check in `app/models/ci/build_trace.rb#converter` and call `Gitlab::Ci::Ansi2json.convert` directly.
- Delete `config/feature_flags/development/ci_ansi2json_v2.yml`.
- [ ] Ensure that the cleanup MR has been included in the release package.
- [ ] Once the cleanup MR has been deployed to production, clean up the feature flag from all environments by running these chatops command in `#production` channel: `/chatops gitlab run feature delete ci_ansi2json_v2 --dev --pre --staging --staging-ref --production`
- [ ] Close this rollout issue.
## Rollback Steps
- [ ] This feature can be disabled on production by running the following Chatops command:
```
/chatops gitlab run feature set ci_ansi2json_v2 false
```
- [ ] Disable the feature flag on non-production environments:
```
/chatops gitlab run feature set ci_ansi2json_v2 false --dev --pre --staging --staging-ref
```
- [ ] Delete feature flag from all environments:
```
/chatops gitlab run feature delete ci_ansi2json_v2 --dev --pre --staging --staging-ref --production
```
issue