[FF] `pipeline_analytics_siphon` -- Route PipelineAnalytics reads through siphon_p_ci_pipelines
## Summary
This issue is to roll out [the feature](https://gitlab.com/gitlab-org/gitlab/-/issues/598440) on production,
that is currently behind the `pipeline_analytics_siphon` feature flag.
When enabled, the `PipelineAnalytics` GraphQL field (Project + Group) reads from
the Siphon-replicated `siphon_p_ci_pipelines` table via
`ClickHouse::Finders::Ci::SiphonPipelinesFinder` (argMax dedup pattern) instead
of the `ci_finished_pipelines_{hourly,daily}` materialized views.
## Owners
- Most appropriate Slack channel to reach out to: `#g_ci-platform`
- Best individual to reach out to: @narendran-kannan
## Expectations
### What are we expecting to happen?
- `/-/pipelines/charts` (Project and Group) continues to render aggregate counts,
per-status counts, and p50/p95/p99 duration statistics with parity to the MV
path within a small expected delta (raw `started_at` filtering vs hour-bucketed
`started_at_bucket`).
- One bespoke sync pipeline (`Ci::ClickHouse::DataIngestion::FinishedPipelinesSyncService`)
becomes redundant once the rollout is complete, freeing up Sidekiq capacity and
removing a CSV-based ingestion path.
### What can go wrong and how would we detect it?
- **Siphon query latency on large groups.** `siphon_p_ci_pipelines` is row-level
(no pre-aggregation), so 90+ day group queries read significantly more bytes
than the daily MV. Detection: PipelineAnalytics request latency dashboards;
GraphQL P95 latency for `pipelineAnalytics` field.
- **`source` enum translation bug.** Siphon stores `source` as `Nullable(Int64)`
enum; the finder translates the Ruby symbol via `Ci::Pipeline.sources`. A drift
here would silently return zero rows for source-filtered queries. Detection:
comparing aggregate counts between the two paths on a sample container.
- **Boundary semantic difference at hour boundaries.** Siphon filters on raw
`started_at`; MV uses hour-truncated `started_at_bucket`. This is intentional
(siphon is more accurate) but customer-facing numbers may shift by a small
fraction at the window boundaries.
- **ReplacingMergeTree dedup correctness.** Finder uses
`argMax(_siphon_replicated_at)` to pick the latest version of each row, and
filters `_siphon_deleted = false`. A regression here would inflate counts.
Detection: spot-check known soft-deleted pipelines.
Relevant dashboards:
- GraphQL field latency: TBD
- ClickHouse query latency: console.clickhouse.com
## Rollout Steps
Note: Please make sure to run the chatops commands in the Slack channel that gets impacted by the command.
### Rollout on non-production environments
- Verify the MR with the feature flag is merged to `master` and has been deployed to non-production environments with `/chatops gitlab run auto_deploy status <merge-commit-of-your-feature>`
- [x] Deploy the feature flag at a percentage (recommended percentage: 50%) with `/chatops gitlab run feature set pipeline_analytics_siphon 50 --actors --dev --pre --staging --staging-ref`
- [x] Monitor that the error rates did not increase (repeat with a different percentage as necessary).
- [x] Enable the feature globally on non-production environments with `/chatops gitlab run feature set pipeline_analytics_siphon true --dev --pre --staging --staging-ref`
- [x] Verify that the feature works as expected.
The best environment to validate the feature in is [`staging-canary`](https://about.gitlab.com/handbook/engineering/infrastructure/environments/#staging-canary) as this is the first environment deployed to. Make sure you are [configured to use canary](https://next.gitlab.com/).
- [ ] If the feature flag causes end-to-end tests to fail, disable the feature flag on staging to avoid blocking [deployments](https://about.gitlab.com/handbook/engineering/deployments-and-releases/deployments/).
- See [`#e2e-run-staging` Slack channel](https://gitlab.enterprise.slack.com/archives/CBS3YKMGD) and look for the following messages:
- test kicked off: `Feature flag pipeline_analytics_siphon has been set to true on **gstg**`
- test result: `This pipeline was triggered due to toggling of pipeline_analytics_siphon feature flag`
If you encounter end-to-end test failures and are unable to diagnose them, you may reach out to the [`#s_developer_experience` Slack channel](https://gitlab.enterprise.slack.com/archives/C07TWBRER7H) for assistance. Note that end-to-end test failures on `staging-ref` [don't block deployments](https://about.gitlab.com/handbook/engineering/infrastructure/environments/staging-ref/#how-to-use-staging-ref).
### Before production rollout
- [ ] If the change is significant and you wanted to announce in [#whats-happening-at-gitlab](https://gitlab.enterprise.slack.com/archives/C0259241C), it best to do it before rollout to `gitlab-org/gitlab-com`.
### Specific rollout on production
For visibility, all `/chatops` commands that target production must be executed in the [`#production` Slack channel](https://gitlab.slack.com/archives/C101F3796)
and cross-posted (with the command results) to the responsible team's Slack channel.
The flag uses a [container actor](https://docs.gitlab.com/development/feature_flags/#feature-actors) (Project or Group), so the most natural enablement granularity is by group.
- Ensure that the feature MRs have been deployed to both production and canary with `/chatops gitlab run auto_deploy status <merge-commit-of-your-feature>`
- [x] Enable for `gitlab-org/gitlab` and dogfood: `/chatops gitlab run feature set --project=gitlab-org/gitlab pipeline_analytics_siphon true`
- [x] Verify on `https://gitlab.com/gitlab-org/gitlab/-/pipelines/charts` that the numbers match what the MV path would return for the same window.
- [ ] Expand to `gitlab-org`: `/chatops gitlab run feature set --group=gitlab-org pipeline_analytics_siphon true`
- [ ] Expand to `gitlab-com`: `/chatops gitlab run feature set --group=gitlab-com pipeline_analytics_siphon true`
### Preparation before global rollout
- [ ] Set a milestone to this rollout issue to signal for enabling and removing the feature flag when it is stable.
- [ ] Check if the feature flag change needs to be accompanied with a
[change management issue](https://about.gitlab.com/handbook/engineering/infrastructure-platforms/change-management/#feature-flags-and-the-change-management-process).
Cross link the issue here if it does.
- [ ] Ensure that you or a representative in development can be available for at least 2 hours after feature flag updates in production.
If a different developer will be covering, or an exception is needed, please inform the oncall SRE by using the `@sre-oncall` Slack alias.
- [ ] Ensure that documentation exists for the feature, and the [version history text](https://docs.gitlab.com/development/documentation/feature_flags/#add-history-text) has been updated.
- [ ] Ensure that any breaking changes have been announced following the [release post process](https://about.gitlab.com/handbook/marketing/blog/release-posts/#deprecations-removals-and-breaking-changes) to ensure GitLab customers are aware.
- [ ] Notify the [`#support_gitlab-com` Slack channel](https://gitlab.slack.com/archives/C4XFU81LG) and your team channel ([more guidance when this is necessary in the dev docs](https://docs.gitlab.com/development/feature_flags/controls/#communicate-the-change)).
- [ ] If this flag is or may be queried by external API consumers (for example, IDE extensions, Duo CLI, or CI integrations), follow the [external API consumer guidance](https://docs.gitlab.com/development/feature_flags/#do-not-use-feature-flags-in-external-api-consumers) and ensure a fail-open mechanism is in place before the rollout milestone is finalised.
### Global rollout on production
For visibility, all `/chatops` commands that target production must be executed in the [`#production` Slack channel](https://gitlab.slack.com/archives/C101F3796)
and cross-posted (with the command results) to the responsible team's Slack channel.
- [x] [Incrementally roll out](https://docs.gitlab.com/development/feature_flags/controls/#process) the feature on production.
- Example: `/chatops gitlab run feature set pipeline_analytics_siphon <rollout-percentage> --actors`.
- Between every step wait for at least 15 minutes and monitor the appropriate graphs on https://dashboards.gitlab.net.
- [ ] After the feature has been 100% enabled, wait for [at least one day before releasing the feature](#release-the-feature).
### Release the feature
After the feature has been [deemed stable](https://about.gitlab.com/handbook/product-development-flow/feature-flag-lifecycle/#including-a-feature-behind-feature-flag-in-the-final-release),
the [clean up](https://docs.gitlab.com/development/feature_flags/controls/#cleaning-up)
should be done as soon as possible to permanently enable the feature and reduce
complexity in the codebase.
You can either [create a follow-up issue for Feature Flag Cleanup](https://gitlab.com/gitlab-org/gitlab/-/issues/new?description_template=Feature%20Flag%20Cleanup)
or use the checklist below in this same issue.
- [ ] Create a merge request to remove the `pipeline_analytics_siphon` feature flag. Ask for review/approval/merge as usual. The MR should include the following changes:
- Remove all references to the feature flag from the codebase.
- Remove the YAML definitions for the feature from the repository.
- Drop the dual-path `clickhouse_model` branch in `CollectPipelineAnalyticsServiceBase`.
- Remove the per-path `let` overrides and the `[true, false].each` loops in:
- `spec/services/ci/collect_aggregate_pipeline_analytics_service_spec.rb`
- `spec/services/ci/collect_time_series_pipeline_analytics_service_spec.rb`
- `spec/requests/api/graphql/project/project_pipeline_analytics_spec.rb`
- `spec/requests/api/graphql/group/group_pipeline_analytics_spec.rb`
- Drop the `ci_finished_pipelines_{hourly,daily}` models (or schedule their removal in a follow-up).
- File a follow-up to retire `Ci::ClickHouse::DataIngestion::FinishedPipelinesSyncService` and the two `ci_finished_pipelines_sync_*_workers` ops flags.
- [ ] Ensure that the cleanup MR has been included in the release package.
If the merge request was deployed before [the monthly release was tagged](https://about.gitlab.com/handbook/engineering/releases/#self-managed-releases-1),
the feature can be officially announced in a release blog post: `/chatops gitlab run release check <merge-request-url> <milestone>`
- [ ] Close [the feature issue](https://gitlab.com/gitlab-org/gitlab/-/issues/598440) to indicate the feature will be released in the current milestone.
- [ ] Once the cleanup MR has been deployed to production, clean up the feature flag from all environments by running these chatops command in `#production` channel: `/chatops gitlab run feature delete pipeline_analytics_siphon --dev --pre --staging --staging-ref --production`
- [ ] Close this rollout issue.
## Rollback Steps
- [ ] This feature can be disabled on production by running the following Chatops command:
```
/chatops gitlab run feature set pipeline_analytics_siphon false
```
- [ ] Disable the feature flag on non-production environments:
```
/chatops gitlab run feature set pipeline_analytics_siphon false --dev --pre --staging --staging-ref
```
- [ ] Delete feature flag from all environments:
```
/chatops gitlab run feature delete pipeline_analytics_siphon --dev --pre --staging --staging-ref --production
```
issue