Fix Failure rate and Success rate denominator on CI/CD analytics page

What does this MR do and why?

Fixes the Failure rate and Success rate on the project CI/CD analytics page (<project>/-/pipelines/charts). Previously both rates used a denominator that included canceled and skipped, which silently diluted them — a real example from gitlab-org/gitlab showed Failure rate 7% + Success rate 91% = 98% (the missing 2% was the canceled/skipped slice absorbed into the denominator).

After this MR:

  • SUCCESS and FAILED rates are computed against success + failed (conclusive outcomes only). The two rates always sum to ~100% on projects with no OTHER jobs/pipelines.
  • OTHER rate (canceled/skipped) keeps the total denominator — "% of all runs that were canceled" only makes sense relative to the total. This preserves the CANCELED_RATE_* sort order semantics for external API consumers.
  • A How this is calculated? info-o popover is added next to the Failure rate KPI and the Failure rate (%) column header in the Jobs panel so users see the formula in-context. UX approved here.

The fix lives in three layers:

  1. Backend (Jobs panel rates): lib/click_house/finders/ci/concerns/finished_builds_aggregations.rbbuild_rate_aggregate now picks the right denominator per status. When both success + failed = 0, ClickHouse returns NaN which serializes to JSON null and the frontend renders -.
  2. Frontend (Pipelines KPI strip): pipelines_stats.vue — computes successCount + failedCount client-side as the rate denominator (there is no rate GraphQL field for pipeline analytics).
  3. Frontend (Jobs panel tooltip): job_analytics_table.vue — the per-cell tooltip N / M uses the same success + failed denominator so the displayed fraction matches the displayed rate.

The GraphQL CiJobAnalyticsStatistics.rate description is updated to document the new semantics.

References

Screenshots or screen recordings

Before After
Screenshot_2026-05-19_at_10.12.36_PM Screenshot_2026-05-19_at_7.32.27_PM
Screenshot_2026-05-19_at_10.14.58_PM image

How to set up and validate locally

Paste the script below into rails console. Edit project_path to point at any non-empty project in your GDK (and optionally test_ref to pick a branch name to filter by on the page). The script creates pipelines + jobs covering every interesting rate case, syncs them to ClickHouse, and prints the URL to open.

Validation script
# ------------------------------------------------------------------------------
# Edit these before running
# ------------------------------------------------------------------------------
project_path = 'group/project'                          # any non-empty project
test_ref     = "rate-fix-validation-#{Time.now.to_i}"   # branch ref for filtering

# ------------------------------------------------------------------------------
# Helpers
# ------------------------------------------------------------------------------
section = ->(title) { puts; puts "=" * 80; puts title; puts "=" * 80 }
info    = ->(message) { puts "  #{message}" }

create_builds = lambda do |count:, status:, pipeline:, stage:, name:, base_time:|
  Array.new(count) do |i|
    FactoryBot.create(
      :ci_build, status,
      project: pipeline.project, pipeline: pipeline, ci_stage: stage, name: name,
      started_at: base_time + i.seconds, finished_at: base_time + (i + 1).seconds
    )
  end
end

create_build_sync_events = lambda do |builds|
  builds.each do |build|
    next if build.finished_at.nil? # skipped builds never sync

    Ci::FinishedBuildChSyncEvent.upsert(
      { build_id: build.id, project_id: build.project_id, build_finished_at: build.finished_at },
      unique_by: [:build_id, :partition]
    )
  end
end

create_pipeline_sync_event = lambda do |pipeline|
  next if pipeline.finished_at.nil?

  Ci::FinishedPipelineChSyncEvent.upsert(
    {
      pipeline_id: pipeline.id,
      pipeline_finished_at: pipeline.finished_at,
      project_namespace_id: pipeline.project.project_namespace_id
    },
    unique_by: [:pipeline_id, :partition]
  )
end

# ------------------------------------------------------------------------------
# 1. Setup
# ------------------------------------------------------------------------------
section.call "Setup"

project = Project.find_by_full_path(project_path) || raise("Project not found: #{project_path}")
info.call "Using project: #{project.full_path} (id=#{project.id})"

raise "ClickHouse is not configured." unless Gitlab::ClickHouse.configured?

settings = ::Gitlab::CurrentSettings.current_application_settings
unless settings.use_clickhouse_for_analytics?
  info.call "Enabling use_clickhouse_for_analytics application setting"
  settings.update!(use_clickhouse_for_analytics: true)
end

Namespace.all.flat_map(&:sync_events).each { |e| ::Ci::NamespaceMirror.sync!(e) }

# ------------------------------------------------------------------------------
# 2. Build out a pipeline with jobs covering each interesting rate case
# ------------------------------------------------------------------------------
section.call "Creating jobs to cover each rate case"

# Place data well inside the default 7-day analytics window. The page's
# toTime is UTC midnight of today, so anything from today is excluded.
base_time = 1.day.ago.utc

info.call "Using test ref: #{test_ref}"

common_attrs = {
  project: project, ref: test_ref, source: :push,
  committed_at: base_time - 2.minutes, started_at: base_time - 1.minute, duration: 60
}

successful_pipeline = FactoryBot.create(:ci_pipeline, :success,  **common_attrs, finished_at: base_time)
failed_pipeline     = FactoryBot.create(:ci_pipeline, :failed,   **common_attrs, finished_at: base_time)
canceled_pipeline   = FactoryBot.create(:ci_pipeline, :canceled, **common_attrs, finished_at: base_time)

build_stage = FactoryBot.create(:ci_stage, pipeline: successful_pipeline, project: project, name: 'build')
test_stage  = FactoryBot.create(:ci_stage, pipeline: successful_pipeline, project: project, name: 'test')

cases = [
  # name                              status     count  stage
  ['rate-100-success',    :success,  5, build_stage],   # 100% success, 0% fail
  ['rate-80-success',     :success,  8, build_stage],
  ['rate-80-success',     :failed,   2, build_stage],   # paired -> 80/20
  ['rate-balanced',       :success,  5, test_stage],
  ['rate-balanced',       :failed,   5, test_stage],
  ['rate-balanced',       :canceled, 2, test_stage],    # ignored by new denominator
  ['rate-90-failed',      :success,  1, test_stage],
  ['rate-90-failed',      :failed,   9, test_stage],
  ['rate-100-failed',     :failed,   5, build_stage],   # 0% success, 100% fail
  ['only-canceled-job',   :canceled, 5, test_stage],    # both rates -> '-'
  ['canceled-heavy-job',  :success,  1, test_stage],
  ['canceled-heavy-job',  :canceled, 8, test_stage]     # old: ~11% success, new: 100%
]

builds = []
cases.each do |(name, status, count, stage)|
  pipeline = case status
             when :success  then successful_pipeline
             when :failed   then failed_pipeline
             when :canceled then canceled_pipeline
             end
  created = create_builds.call(
    count: count, status: status, pipeline: pipeline, stage: stage, name: name, base_time: base_time
  )
  builds.concat(created)
  info.call "  #{name.ljust(20)} x#{count.to_s.rjust(2)} (#{status})"
end

# Skipped build to demonstrate the producer-side gap (no finished_at, no sync event).
builds << FactoryBot.create(
  :ci_build, :skipped,
  project: project, pipeline: successful_pipeline, ci_stage: test_stage, name: 'skipped-never-syncs'
)
info.call "  skipped-never-syncs   x 1 (skipped) - should NOT appear in Jobs panel"

# ------------------------------------------------------------------------------
# 3. Sync to ClickHouse
# ------------------------------------------------------------------------------
section.call "Syncing to ClickHouse"

info.call "Creating build sync events for #{builds.count(&:finished_at)} finished builds"
create_build_sync_events.call(builds)

info.call "Creating pipeline sync events"
[successful_pipeline, failed_pipeline, canceled_pipeline].each { |p| create_pipeline_sync_event.call(p) }

info.call "Running ClickHouse::DataIngestion::CiFinishedBuildsSyncService"
build_result = ClickHouse::DataIngestion::CiFinishedBuildsSyncService.new.execute
info.call "  -> #{build_result.payload.except(:worker_index, :total_workers, :mode).inspect}"

info.call "Running Ci::ClickHouse::DataIngestion::FinishedPipelinesSyncService"
pipeline_result = Ci::ClickHouse::DataIngestion::FinishedPipelinesSyncService.new.execute
info.call "  -> #{pipeline_result.payload.except(:worker_index, :total_workers, :mode).inspect}"

# ------------------------------------------------------------------------------
# 4. Print expected values and the URL to open
# ------------------------------------------------------------------------------
section.call "Expected values on the page"

puts <<~EXPECTED
  KPI strip (Pipelines):
    Total pipeline runs:  3 (success + failed + canceled)
    Failure rate:         50%   (1 failed / (1 success + 1 failed); canceled excluded)
    Success rate:         50%
    -> info-o popover next to "Failure rate" with the formula tooltip.
    -> The two rates should sum to ~100% (was 33% + 33% = 67% under the old bug).

  Jobs panel:
    rate-100-success       5/5 success   Failure rate 0%      Success rate 100%
    rate-80-success        8/10 success  Failure rate 20%     Success rate 80%
    rate-balanced          5/10 success  Failure rate 50%     Success rate 50%
                           (the 2 canceled are excluded from the denominator;
                            old bug would have shown 41.67% / 41.67%)
    rate-90-failed         1/10 success  Failure rate 90%     Success rate 10%
    rate-100-failed        0/5 success   Failure rate 100%    Success rate 0%
    only-canceled-job      -             Failure rate -       Success rate -
                           (0 success + 0 failed -> NaN -> null -> '-')
    canceled-heavy-job     1/1 success   Failure rate 0%      Success rate 100%
                           (old bug would have shown ~0% / ~11%)
    skipped-never-syncs    NOT PRESENT (skipped builds are not synced)

    -> info-o popover next to "Failure rate (%)" column header with formula.
    -> Hover each rate cell: tooltip shows "<count> / (success+failed)" matching
       the displayed rate, not "<count> / total".
EXPECTED

section.call "Open this URL to verify"
puts "#{Gitlab.config.gitlab.url}/#{project.full_path}/-/pipelines/charts?branch=#{test_ref}"
puts "  (URL includes ?branch=#{test_ref} so the page filters to just this test data.)"

# ------------------------------------------------------------------------------
# 5. Cleanup
# ------------------------------------------------------------------------------
section.call "Cleaning up"
[successful_pipeline, failed_pipeline, canceled_pipeline].each(&:destroy!)
info.call "Pipelines destroyed. Note: ClickHouse rows are not removed by this script."

puts
puts "Done."

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Narendran

Merge request reports

Loading