Skip to content

Add jobAnalytics GraphQL API to fetch data on our jobs

Description

In #453956 (closed) we describe a adding a new panel with job metrics.

The panel should be fed to with information from our ClickHouse data source via our GraphQL API.

Motivation

Our ClickHouse analytics data source can provide powerful

We can use the dimension and metric concept to allow API users to build different queries that can be presented in different visualizations.

Technical proposal

Our jobs analytics would be based on the data in the ci_finished_builds ClickHouse table, our API should be able to query this table's dimensions and metrics like this:

SELECT
    -- dimensions
    name as jobName, 
    stage,

    -- metrics
    count(*) as count,
    avg(duration) as avgDuration,
    quantile(0.95)(duration) as p95Duration,
    countIf(status = 'failed') as failedCount,
    countIf(status = 'success') as successCount
FROM ci_finished_builds
WHERE name like 'job%'
GROUP BY name, stage
ORDER BY avgDuration desc
LIMIT 10;
This data as would be presented in the UI

image

We can create an schema that gets used to generate a query like this to fetch the job metrics:

const input = {
  dimensions: [
    { column: "name", as: "jobName" },
    { column: "stage" },
  ],
  metrics: [
    { function: "count" },
    { function: "avg", column: "duration",  as: "medianDuration" },
    { function: "quantile(0.95)", column: "duration",  as: "p95Duration" },
    {
      function: "countIf",
      condition: [
        { column: "status", operator: "=", value: "failed" },
      ],
      as: "failedCount"
    },
    {
      function: "countIf",
      condition: [
        { column: "status", operator: "=", value: "success" },
      ],
      as: "successCount"
    }
  ],
  filter: [
    { column: "jobName", operator: "LIKE", value: "job%" }
  ],
  orderBy: [
    [{ 
      column: 'medianDuration',
      desc: true
    }, {
      column: 'jobName',
      desc: true
    }]
  ],
  limit: 10,
  offset: 0
};

GraphQL schema (draft)

Our schema would require users to define at least one metric, some dimensions and an optional number of filters. The schema for each looks like this:

Analytics Query

input AnalyticsQueryInput {
  dimensions: [AnalyticsDimensionInput]
  metrics: [AnalyticsMetricInput!] # at least one metric is needed, like `count`
  filter: [AnalyticsFilterInput]
  orderBy: [AnalyticsOrderByInput]
  limit: Int
  offset: Int
}
query getJobAnalytics(
  $fullPath: ID!
  $fromTime: Time!
  $toTime: Time!
  $source: CiPipelineSources
  $branch: String
  $query: AnalyticsQuery # here is where the magic happens!
) {
  project(fullPath: $fullPath) {
    id
    pipelineAnalytics(fromTime: $fromTime, toTime: $toTime, source: $source, ref: $branch, $query: query) {
      values # ordered array representation of the rows
      errors
    }
  }
}

Dimension input

input AnalyticsDimensionInput {
  column: String! # e.g. the job's "name", "stage"
  as: String # optional
}

Example of a dimension:

{ column: "name" } // maps to SELECT "name" ... GROUP BY "name"
{ column: "name", as: "jobName" } // maps to SELECT "name" as "jobName" ... GROUP BY "jobName"

Metric input

input AnalyticsMetricInput {
  function: String! # an aggregate function e.g. `avg`, `quantile(0.5)`, `countIf`
  column: String # e.g. duration, none represents `*`
  filter: FiltersInput # to pass to combinators e.g. { column: `status`, operator: `=`, value: 'failed' }
  as: String # avgDuration
}

Example of a few metrics:

{ function: `count` as: `count` }
// maps to: SELECT ... count(*) as "count" FROM ...

{ function: `avg`, column: `duration`, as: `avg` }
// maps to: SELECT ... avg(duration) as "avg" FROM ...

{ function: `countIf`, filter: { column: `status`, operator: `=`, value: 'failed' }, as: `failedCount` }
// maps to: SELECT ... countIf(status = 'failed') as "failedCount" FROM ...

Filters input

input AnalyticsFiltersInput {
  column: String, # e.g. name
  operator: String, # e.g. =, LIKE
  value: [String, Number], # comparator
}

Example of a filter:

{ column: "jobName", operator: "=", value: "job" }
// maps to: SELECT ... WHERE jobName = "job"

{ column: "jobName", operator: "LIKE", value: "job%" }
// maps to: SELECT ... WHERE jobName = "job%"

Order By

input AnalyticsOrderByInput {
  column: String, # e.g. name (the job name)
  asc: Boolean, # true by default, send false for desc 
}

Example of an orderBy (we can pass multiple columns).

{ column: "avgDuration", desc: true }

Out of scope

  • Time series: #453956 (closed) does not yet consider which act as a special kind of grouping over a time dimension.
    • We expect this API can be extended to include time series with different time periods.
  • Other operators/filters We are only considering operators needed to implement the query
Edited by 🤖 GitLab Bot 🤖