[VSA][Discovery] Backend performance optimization before query change

Value Stream Analytics backend proposal

TLDR

Due to a proposed timestamp filter change, the DB queries of Value Stream Analytics will get more complicated and require significant database-related work to make them viable (same performance as today). The work would take several milestones. I think we can make the feature more performant and "future-proof" by collecting the VSA related data in a different table(s) and keep it up to date asynchronously.

Feature overview

Value stream analytics measures the duration between two timestamp columns and exposes the matching records. This functionality enables the user to measure the cycle time during the development process.

Example:

Calculate the average duration between merge request merged at and merge request first deployed to production time.
Calculate the average duration between merge request creation and when "workflow::production" label was added.
Show the relevant records sorted by duration (longest duration first).

SQL (merged at - first deployed to production):

SELECT AVG(merge_request_metrics.first_deployed_to_production_at - merge_request_metrics.merged_at)
FROM merge_requests
INNER JOIN (merge_request_metrics...)
WHERE
merge_requests.target_project_id IN (recursive namespace query)
AND merge_requests.created_at BETWEEN 'X' and 'Y'

Notice the created_at filter. This limits the number of rows the DB needs to scan.

The timestamp column definitions (start event, end event) are stored in a model called Stage (table: analytics_cycle_analytics_group_stages). A Stage is associated with a Group or a Project. We have a fixed list of supported timestamp columns. Currently, these columns are scattered across the database:

merge_requests
issues
merge_request_metrics
issue_metrics
resource_label_events

There are strict rules which timestamp columns are allowed together, a slightly outdated graph can be found here: https://docs.gitlab.com/ee/development/value_stream_analytics.html#validating-start-and-end-events

Performance

The group-level performance was reviewed in 2019 Q4 and it was approved however since then the feature got significantly slower. Note: we use the standard IssueableFinders to build the queries and we look at the whole hierarchy (recursive namespace lookup). My assumption is the number of subgroups, projects, issues and merge requests grew significantly in the past year affect the query performance significantly.

Median aggregation query from DB-lab: https://explain.depesz.com/s/fj9B, https://explain.depesz.com/s/lvYG
Records query: https://explain.depesz.com/s/5AKY

VSA queries Issue or MergeRequest records within a given time range based on the created_at column. We have an index covering the project_id and created_at columns in both database tables so the number of records we scan is somewhat limited. Note: this is going to change.

The feature change

The way VSA works (created_at filter) was defined a long time ago and we don't exactly know the reasoning behind it. There is a proposal to change the created_at filter to use the end event timestamp to address the customer confusion concerns.

We have about 15 possible end events in different database tables which would make the database queries more difficult to optimize (add index or de-normalize and add index).

Supported end events:

issues.created_at
issue_metrics.first_mentioned_in_commit_at
issue_metrics.first_associated_with_milestone_at
issue_metrics.first_added_to_board_atstone_at
merge_request_metrics.first_deployed_to_production_at
merge_request_metrics.latest_build_started_at
merge_request_metrics.latest_build_finished_at
merge_request_metrics.latest_closed_at
merge_request_metrics.first_commit_at
merge_request_metrics.merged_at
merge_requests.created_at
issues.closed_at
issues.last_edited_at
resource_label_events.created_at

Example 1: Index needed

In case the end event is merge_request_metrics.latest_build_finished_at an extra index is needed on the target_project_id, latest_build_finished_at columns. Note that the target_project_id is available in the merge_request_metrics, this de-normalization was done for an other feature.

We might also need to tweak the query to avoid joining all the MergeRequest records in the group hierarchy and only look at the ones that matching the merge_request_metrics.latest_build_finished_at filter.

Example query:

SELECT AVG(merge_request_metrics.latest_build_finished_at - merge_requests.created_at)
FROM merge_requests
INNER JOIN (merge_request_metrics...)
WHERE
merge_requests.target_project_id IN (recursive namespace query)
AND merge_request_metrics.latest_build_finished_at BETWEEN 'X' and 'Y'

Example 2: De-normalization

The issue_metrics database table does not have the project_id column, so in case the end event is issue_metrics.first_added_to_board_at, then the DB loads all issues within the group hierarchy and join the issue_metrics database table. The query will need to scan all issues records to filter them down by the joined issue_metrics table.

To make it performant, we can add the project_id to the issue_metrics table.

Note: the same thing needs to happen in the resource_label_events table.

Database concerns

By doing these query optimizations for each database columns (about 15 columns) we can make the feature as performant as the current version, however, in my opinion, the costs are quite high:

De-normalization work for at least 2 tables (minimum 2 milestones).
Index creation on (project_id, timestamp_column).
Rewriting, optimizing and reviewing the issuable queries for each column.
One index is about 3GB, in some cases, we would have fewer rows in the index.

I estimate around 40-50 gigabyte for the indexes only. Note that this number will grow at the same rate as GitLab.com grows.

Alternative solution

Value Stream Analytics is a premium feature so only a small percentage of the MergeRequest (<10%) and Issue (<3%) records are going to be queried on GitLab.com. By looking at the VSA usage, this number decreases a little bit (MergeRequest <5%, Issue <3%).

My idea is to collect and track the start and event timestamps in a separate table so the queries will be mostly INDEX ONLY SCANs. To deal with the large data volume, we would run the AVERAGE and MEDIAN queries with batching (keyset paginated) with some sort of limit to avoid reading too many rows per web request. For example: read maximum 10x10_000 rows (10 queries). If the limit is reached show message: truncated results, too much data.

The database structure for collecting stage data (for merge requests). Table name: analytics_merge_request_stage_events

column	type	description
merge_request_id	bigint	composite primary key with stage_id
stage_id	bigint	points to the stage record where the start and end event is defined
start_event_timestamp	bigint	stores the start event timestamp for the stage
end_event_timestamp	bigint	stores the end event timestamp for the stage
group_id	bigint	Project's group
project_id	bigint	Project where the merge request was created

This structure allows very fast queries where we filter the end event with a time range. Indexes:

# in the root group
(stage_id, end_event_timestamp, start_event_timestamp, merge_request_id)
# in a subgroup
(stage_id, group_id, end_event_timestamp, start_event_timestamp, merge_request_id)
# in a project
(stage_id, project_id, end_event_timestamp, start_event_timestamp, merge_request_id)

Query example:

SELECT AVG(end_event_timestamp - start_event_timestamp)
FROM analytics_merge_request_stage_events
WHERE end_event_timestamp BETWEEN '2020-01-01' AND '2020-06-06'
AND stage_id = 5

The merge_request_id would provide the tie-breaker behaviour when listing records (pagination) and also could be used when aggregating in batches (avoiding statement timeout). For example count and sum the duration for 10000 records. The query would run N times and we could calculate the average in Ruby:

# N = 2
count_batch_1 = 10000
sum_batch_1 = 123123123

count_batch_2 = 15
sum_batch_2 = 123

average = (sum_batch_1 + sum_batch_2) / (count_batch_1 + count_batch_2)

Finders and queries

The current IssuableFinder adds extra JOINs and EXISTS queries to perform the access control (confidential issues, project authorization). Since VSA operates on the Group level, we could check group permissions on page load so we can avoid complicated queries.

Data separation

VSA is an analytical feature that ideally should be in its own database (Timescale?). Today, it's too early to make this move however, with the sharding group work things might be easier in the future. For now, what we can do is prepare for easy separation from the core database tables:

Eventually consistent data.
Asynchronous data collection.
No foreign keys on the database columns.

We'll need a well-defined interface where the data boundaries are crossed (for example when merge_requests table is joined). In case our data is in a different database, we could "easily" modify the queries to work across database nodes (disable_joins).

Data collection

Stage records are associated with a Group or with a Project. By collecting all stages in the group hierarchy, we can build batched queries to insert new data in the analytics_merge_request_stage_events database table.

The following table keeps track of the latest processed updated_at column value: analytics_merge_request_stage_updates

column	type	description
stage_id	bigint	primary key
last_updated_at	timestamp	stores the last processed `merge_requests.updated_at`

First run:

Find the top-level Group for the stage.
Build a batched (keyset paginated) query over the merge requests in the Group hierarchy. (PoC: !62379 (closed))
Collect the timestamps that are required for the stages.
Insert the rows into the analytics_merge_request_stage_events table.
Insert a row with stage_id and last_updated_at values.

Second run:

Find the top-level Group for the stage.
Build a batched (keyset paginated) query over the merge requests in the Group hierarchy.
Using the last_updated_at column, iterate over the recently updated records.
Insert the rows into the analytics_merge_request_stage_events table.
Update the last_updated_at column.

Note: we might need to do multiple LEFT JOINs to gather the relevant timestamp columns from different tables however, this would not make the query significantly slower. The batch query would always read N MergeRequest rows.

Not sure this is the best idea. Currently, I don't have a "cheaper" way of detecting Issue or `MergeRequest changes in a group hierarchy. I don't think we have hosted plan specific pub-sub functionality.

Consistency

When a MergeRequest or an Issue record is updated, the record update process described above would "catch" the changes and the data would be updated eventually.

We also need to ensure consistency when the project is removed or some records were deleted. We can do that with a periodic job that scans through the analytics_merge_request_stage_events database table and makes an EXISTS query against the merge_requests table.

Table size

Since we might have multiple stages per group, we would store quite a lot's of data in the analytics_merge_request_stage_events database table (worst case: 5 rows per merge request record). To address table size concerns, we could do hash partitioning or time range partitioning (use time-decay pattern if Product is OK with keeping 1 or 2 years of data).

Cross joins

VSA allows filtering on labels, author, assignee, milestone. With the new structure, we would be able to query this data using JOINs. When we have complete separation (different data store), this data would need to be duplicated.

No FOSS

There is a free (FOSS) version of VSA with fixed stages, no customization allowed, project-level only. Collecting data for FOSS users would cost too much storage and processing power. In order to keep the data volume low, we cannot support the FOSS version. We can keep the old behavior (no timestamp filter change).

Edited Jun 16, 2021 by Dan Jensen