Commit d3a6fdb2 authored by Marc Saleiko's avatar Marc Saleiko
Browse files

Adds design documents from first half of 2023

parent 31e2bb58
Loading
Loading
Loading
Loading
+142 −0
Original line number Diff line number Diff line
---
title: CI Builds and Runner Fleet metrics database architecture
status: proposed
creation-date: "2023-01-25"
authors: [ "@pedropombeiro", "@vshushlin"]
coach: "@grzesiek"
approvers: []
stage: Verify
group: Runner
participating-stages: []
toc_hide: true
---

{{< design-document-header >}}

The CI section envisions new value-added features in GitLab for CI Builds and Runner Fleet focused on observability and automation. However, implementing these features and delivering on the product vision of observability, automation, and AI optimization using the current database architecture in PostgreSQL is very hard because:

- CI-related transactional tables are huge, so any modification to them can increase the load on the database and subsequently cause incidents.
- PostgreSQL is not optimized for running aggregation queries.
- We also want to add more information from the build environment, making CI tables even larger.
- We also need a data model to aggregate data sets for the GitLab CI efficiency machine learning models - the basis of the Runner Fleet AI solution

We want to create a new flexible database architecture which:

- will support known reporting requirements for CI builds and Runner Fleet.
- can be used to ingest data from the CI build environment.

We may also use this database architecture to facilitate development of AI features in the future.

Our recent usability research on navigation and other areas suggests that the GitLab UI is overloaded with information and navigational elements.
This results from trying to add as much information as possible and attempting to place features in the most discoverable places.
Therefore, while developing these new observability features, we will rely on the jobs to be done research, and solution validation, to ensure that the features deliver the most value.

## Runner Fleet

### Metrics - MVC

#### What is the estimated wait time in queue for an instance runner?

The following customer problems should be solved when addressing this question. Most of them are quotes from our usability research

**UI**

- "There is no visibility for expected Runner queue wait times."
- "I got here looking for a view that makes it more obvious if I have a bottleneck on my specific runner."

**Types of metrics**

- "Is it possible to get metrics out of GitLab to check for the runners availability & pipeline wait times?
  Goal - we need the data to evaluate the data to determine if to scale up the Runner fleet so that there is no waiting times for developer’s pipelines."
- "What is the estimated time in the Runner queue before a job can start?"

**Interpreting metrics**

- "What metrics for Runner queue performance should I look at and how do I interpret the metrics and take action?"
- "I want to be able to analyze data on Runner queue performance over time so that I can determine if the reports are from developers are really just rare cases regarding availability."

#### What is the estimated wait time in queue on a group runner?

#### What is the mean estimated wait time in queue for all instance runners?

#### What is the mean estimated wait time in queue for all group runners?

#### Which runners have failures in the past hour?

## CI Insights

CI Insights is a page that would mostly expose data on pipelines and jobs duration, with a multitude of different filters, search and dynamic graphs. To read more on this, see [this related sub-section](ci_insights.md).

## Implementation

The current implementation plan is based on a
[Proof of Concept](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/126863).
For an up to date status, see [epic 10682](https://gitlab.com/groups/gitlab-org/-/epics/10682).

### Database selection

In FY23, ClickHouse [was selected as GitLab standard datastore](../../../../company/working-groups/clickhouse-datastore/#context)
for features with big data and insert-heavy requirements.
So we have chosen it for our CI analytics as well.

### Scope of data

We're starting with the denormalized version of the `ci_builds` table in the main database,
which will include fields from some other tables. For example, `ci_runners` and `ci_runner_machines`.

[Immutability is a key constraint in ClickHouse](https://docs.gitlab.com/ee/development/database/clickhouse/index.html#how-it-differs-from-postgresql),
so we only use `finished` builds.

### Developing behind feature flags

It's hard to fully test data ingestion and query performance in development/staging environments.
That's why we plan to deliver those features to production behind feature flags and test the performance on real data.
Feature flags for data ingestion and APIs will be separate.

### Data ingestion

Every time a job finished, a record will be created in a new `p_ci_finished_build_ch_sync_events` table, which includes
the `build_id` and a `processed` value.
A background worker loops through unprocessed `p_ci_finished_build_ch_sync_events` records and push the denormalized
`ci_builds` information from Postgres to ClickHouse.

At some point we most likely will need to
[parallelize this worker because of the number of processed builds](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/126863#note_1494922639).
This will be achieved by having the cron worker accept an argument determining the number of workers. The cron worker
will use that argument to queue the respective number of workers that will actually perform the syncing to ClickHouse.

We will start with most recent builds and will not upload all historical data.

### "Raw data", materialized views and queries

Ingested data will go to the "raw data" table in ClickHouse.
This table will use `ReplacingMergeTree` engine to deduplicate rows in case data ingestion mechanism accidentally submits the same batch twice.

Raw data can be used directly do execute queries, but most of the time we will create specialized materialized views
using `AggregatingMergeTree` engine.
This will allow us to read significantly less data when performing queries.

### Limitations and open questions

The topics below require further investigation.

#### Efficient way of querying data for namespaces

We start with the PoC available only for administrators,
but very soon we will need to implement features on the group level.

We can't just put denormalized "path" in the source table because it can be changed when groups or projects are moved.

The simplest way of solving this is to always filter builds by `project_id`,
but this may be inefficient and require reading a significant portion of all data because ClickHouse stores data in big batches.

#### Keeping the database schema up to date

Right now we don't have any mechanism equivalent to migrations we use for PostgreSQL.
While developing our first features we will maintain database schema by hand and
continue developing mechanisms for migrations.

#### Re-uploading data after changing the schema

If we need to modify database schema, old data maybe incomplete.
In that case we can simply truncate the ClickHouse tables and re-upload (part of) the data.
+144 −0
Original line number Diff line number Diff line
---
title: 'CI Insights'
---

## Summary

As part of the Fleet Metrics, we would like to have a section dedicated to CI insights to help users monitor pipelines and summarize findings about pipelines speed, common job failures and more. It would eventually offer actionables to help users optimize and fix issues with their CI/CD.

## Motivation

We have a [page for CI/CD Analytics](https://gitlab.com/gitlab-org/gitlab/-/pipelines/charts?chart=pipelines) that contain some very basic analytics on pipelines. Most of this information relates to the **total** number of pipelines over time, which does not give any real value to customers: projects will always see an increase of pipelines number over time, so the total number of pipelines is of little consequence.

![Current page](../img/current_page.png)

Because this page lacks real insights, it makes understanding pipelines slowdowns or failures hard to track and becomes a very manual task. We want to empower users to optimize their workflow in a centralized place to avoid all of the manual labor associated with either querying the API for data and then manually parsing it or navigating the UI through dozens of pages utils the insights or action required can be found.

As we are going to process large quantities of data relating to a proejct pipelines, there is potential to eventually summarize findings with an AI tool to give insights into job failures, pipeline slowdowns and flaky specs. As AI has become a crucial part of our product roadmap and Verify lacks any promising lead in that area, this page could be the center of this new addition.

- Deliver a new Pipelines Analysis Dashbord page
- Have excellent data visualization to help digest information quickly
- Flexible querying to let users get the information they want

- Clear actionables based on information presented in the page
- Show some default information on landing like pipelines duration over time and slowest jobs
- Make the CI/CD Analytics more accessible, liked and remembered (AKA, more page views)

### Non-Goals

We do not aim to improve the GitLab project's pipeline speed. This feature could help us achieve this, but it is not a direct objective of this blueprint.

We also are not aiming to have AI in the first iteration and should instead focus on making as much information available and disgestible as possible.

## Proposal

Revamp the [page for CI/CD Analytics](https://gitlab.com/gitlab-org/gitlab/-/pipelines/charts?chart=pipelines) to include more meaningful data so that users can troubleshoot their pipelines with ease. Here is a list of the main improvements:

### Overall statistics

The current "overall statistics" will become a one line header in a smaller font to keep this information available, but without taking as much visual space. For the pipelines chart, we will replace it with a stacked bar plot where each stack of a bar represents a status and each bar is a unit (in days, a day, in month a month and in years, a year) so users can keep track of how many pipelines ran in that specific unit of time and what percent of these pipelines ended up in failling or succeeding.

### Pipeline duration graph

A new pipeline duration graph that can be customized by type (MR pipelines, pipeline on a specific branch, etc), number of runs and status (success, failed, etc) and will replace the current `Pipeline durations for the last 30 commits` chart. The existing chart checks the latest 30 commits made on the repository with no filtering so the results presented are not very valuable.

We also add jobs that failed multiple times and jobs that are the slowest in the last x pipelines on master. All of this is to support the effort of allowing users to query their pipelines data to figure out what they need to improve on or what kind of problems they are facing with their CI/CD configuration.

### Visibility

Add a link in the `pipelines` page to increase the visibility of this feature. We can add a new option with the `Run pipeline` primary button.

### Master Broken

Add a "Is master broken?" quick option that scans the last x pipelines on the main branch and check for failed jobs. All jobs that failed multiple times will be listed in a table with the option to create an incident from that list.

### Color scheme

Rethink our current color schemes for data visualization when it comes to pipelines statuses. We currently use the default visualization colors, but they don't actually match with that colors user have grown accustomed to for pipeline/jobs statuses. There is an opportunity here to help user better understand their data through more relevant color schemes and better visualization.

### Routing

Change the routing from `pipelines/charts` to `pipelines/analytics` since `charts` is a really restrictive terminology when talking about data visualization. It also doesn't really convey what this page is, which is a way to get information, not just nice charts. Then we can also get rid of the query parameter for the tabs and instead support first-class routing.

## Design and implementation details

### New API for aggregated data

This feature depends on having a new set of data available to us that aggregates jobs and pipelines insights and make them available to the client.

We'll start by aggregating data from ClickHouse, and probably only for `gitlab.com`, as the MVC. We will aggregate the data on the backend on the fly. So far ClickHouse has been very capable of such things.

We won't store the aggregated data anywhere (we'll probably have the materialized views in ClickHouse, but nothing more complex). Then if the features get traction, we can explore ways to bring these features to environments without ClickHouse

This way we can move fast, test our ideas with real users, and get feedback.

### Feature flag

To develop this new analytic page, we will gate the new page behind a feature flag `ci_insights`, and conditionally render the old or new analytics page. Potentially, we could even add the flag on the controller to decide which route to render: the new `/analytic` when the flag is one, and the old `/charts` when it isn't.

### Add analytics on page view

Make sure that we can get information on how often this page is viewed. If we do not have it, then let's implment some to know how visible this page is. The changes to this section should make the view count go up and we want to track this as a measure of success.

### Routing

We are planning to have new routes for the page and some redicts to set up. To read more about the routing proposal, see the [related issue](https://gitlab.com/gitlab-org/gitlab/-/issues/437556)

### Pipelines duration graph

We want a way for user to query data about pipelines with a lot of different criterias. Most notably, query for only pipelines with the scope `finished` or by status `success` or `failed`. There is also the possibility to scope this to a ref, so users could either test for the main branch or maybe even a branch that has introduced a CI/CD change. We want branch comparaison for pipeline speed.

To get more accurate data, we want to increase the count of pipelines requested. In graphQL, we have a limit of 100 items and we will probably get performance degradations quite quickly. We need to define how we could get more data set for more accurate data visualization.

### Jobs insights

Currently, there is no way to query a single job across multiple pipelines and it prevent us from doing a query that would look like this:

```graphql
query getJob($projectPath: ID!, $jobName: String!){
  project(fullPath:$projectPath){
    job(name: $jobName, last: 100){
      nodes{
        id
        duration
      }
    }
  }
}
```

There are plans to create a new unified table to log job analytics and it is not yet defined what this API will look like. Without comitting yet to an API definiton, we want so unified way to query information for nalytics that may look rougly like so:

```ruby
get_jobs(project_id:, job_name: nil, stage: nil, stage_index: nil, *etc)
# >
[{id: 1, duration: 134, status: 'failed'}, *etc] 

get_jobs_statistics(project_id, job_name:, *etc)
# >
[{time_bucket: '2024-01-01:00:00:00', avg_duration: 234, count: 123, statuses_count: {success: 123, failed: 45, cancelled: 45}}]
```

### Revamping our charts

Explore new color scheme and nicer look on our charts. Colaborate with UX to determine whether this is something we had on our mind or not and support any iniative to have nicer, more modern looking charts as our charts are quite forgettable.

## Alternative Solutions

### New page

We could create a brand new page and leave this section as it is. The pro would be that we could perhaps have a more prominent placement in the Navigation under `Build`, while the cons are that we'd have clear overlap with the section.

### Pipeline analysis per pipeline

There was an [experiment](https://gitlab.com/gitlab-org/gitlab/-/issues/365902) in the past to add performance insights **per pipeline**. The experiment was removed and deemed not viable. Some of the findings were that:

- Users did not interact with the page as much as thought and would not click on the button to view insights
- Users who did click on the button did not try to get more insights into a job.
- Users did not leave feedback in the issue.

This experiment reveals to us mostly that users who go on the pipeline graph page `pipelines/:id` are **not** trying to imrpove the performance of pipelines. Instead, it is most likely that this page is used to debug pipeline failures, which means that they are from the IC/developer persona, not the DevOps engineer trying to improve the workflow. By having this section in a more "broad" area, we expect a much better adoption and more useful actionables.

### Do nothing

We could leave this section untouched and not add any new form of analytics. The pro here would be the saved resources and time. The cons are that we currently have no way to help customers improve their CI/CD configurations speed except reading our documentation. This revamped section would also be a great gateway for AI features and help user iteration on their setup.
+450 −0

File added.

Preview size limit exceeded, changes collapsed.

+288 −0

File added.

Preview size limit exceeded, changes collapsed.

Loading