[VSA][Discovery] Backend performance optimization before query change
Value Stream Analytics backend proposal
TLDR
Due to a proposed timestamp filter change, the DB queries of Value Stream Analytics will get more complicated and require significant database-related work to make them viable (same performance as today). The work would take several milestones. I think we can make the feature more performant and "future-proof" by collecting the VSA related data in a different table(s) and keep it up to date asynchronously.
Feature overview
Value stream analytics measures the duration between two timestamp columns and exposes the matching records. This functionality enables the user to measure the cycle time during the development process.
Example:
- Calculate the average duration between merge request merged at and merge request first deployed to production time.
- Calculate the average duration between merge request creation and when "workflow::production" label was added.
- Show the relevant records sorted by duration (longest duration first).
SQL (merged at - first deployed to production):
SELECT AVG(merge_request_metrics.first_deployed_to_production_at - merge_request_metrics.merged_at)
FROM merge_requests
INNER JOIN (merge_request_metrics...)
WHERE
merge_requests.target_project_id IN (recursive namespace query)
AND merge_requests.created_at BETWEEN 'X' and 'Y'
Notice the created_at filter. This limits the number of rows the DB needs to scan.
The timestamp column definitions (start event, end event) are stored in a model called Stage (table: analytics_cycle_analytics_group_stages). A Stage is associated with a Group or a Project. We have a fixed list of supported timestamp columns. Currently, these columns are scattered across the database:
merge_requestsissuesmerge_request_metricsissue_metricsresource_label_events
There are strict rules which timestamp columns are allowed together, a slightly outdated graph can be found here: https://docs.gitlab.com/ee/development/value_stream_analytics.html#validating-start-and-end-events
Performance
The group-level performance was reviewed in 2019 Q4 and it was approved however since then the feature got significantly slower. Note: we use the standard IssueableFinders to build the queries and we look at the whole hierarchy (recursive namespace lookup). My assumption is the number of subgroups, projects, issues and merge requests grew significantly in the past year affect the query performance significantly.
- Median aggregation query from DB-lab: https://explain.depesz.com/s/fj9B, https://explain.depesz.com/s/lvYG
- Records query: https://explain.depesz.com/s/5AKY
VSA queries Issue or MergeRequest records within a given time range based on the created_at column. We have an index covering the project_id and created_at columns in both database tables so the number of records we scan is somewhat limited. Note: this is going to change.
The feature change
The way VSA works (created_at filter) was defined a long time ago and we don't exactly know the reasoning behind it. There is a proposal to change the created_at filter to use the end event timestamp to address the customer confusion concerns.
We have about 15 possible end events in different database tables which would make the database queries more difficult to optimize (add index or de-normalize and add index).
Supported end events:
issues.created_at
issue_metrics.first_mentioned_in_commit_at
issue_metrics.first_associated_with_milestone_at
issue_metrics.first_added_to_board_atstone_at
merge_request_metrics.first_deployed_to_production_at
merge_request_metrics.latest_build_started_at
merge_request_metrics.latest_build_finished_at
merge_request_metrics.latest_closed_at
merge_request_metrics.first_commit_at
merge_request_metrics.merged_at
merge_requests.created_at
issues.closed_at
issues.last_edited_at
resource_label_events.created_at
Example 1: Index needed
In case the end event is merge_request_metrics.latest_build_finished_at an extra index is needed on the target_project_id, latest_build_finished_at columns. Note that the target_project_id is available in the merge_request_metrics, this de-normalization was done for an other feature.
We might also need to tweak the query to avoid joining all the MergeRequest records in the group hierarchy and only look at the ones that matching the merge_request_metrics.latest_build_finished_at filter.
Example query:
SELECT AVG(merge_request_metrics.latest_build_finished_at - merge_requests.created_at)
FROM merge_requests
INNER JOIN (merge_request_metrics...)
WHERE
merge_requests.target_project_id IN (recursive namespace query)
AND merge_request_metrics.latest_build_finished_at BETWEEN 'X' and 'Y'
Example 2: De-normalization
The issue_metrics database table does not have the project_id column, so in case the end event is issue_metrics.first_added_to_board_at, then the DB loads all issues within the group hierarchy and join the issue_metrics database table. The query will need to scan all issues records to filter them down by the joined issue_metrics table.
To make it performant, we can add the project_id to the issue_metrics table.
Note: the same thing needs to happen in the resource_label_events table.
Database concerns
By doing these query optimizations for each database columns (about 15 columns) we can make the feature as performant as the current version, however, in my opinion, the costs are quite high:
- De-normalization work for at least 2 tables (minimum 2 milestones).
- Index creation on
(project_id, timestamp_column). - Rewriting, optimizing and reviewing the issuable queries for each column.
- One index is about 3GB, in some cases, we would have fewer rows in the index.
I estimate around 40-50 gigabyte for the indexes only. Note that this number will grow at the same rate as GitLab.com grows.
Alternative solution
Value Stream Analytics is a premium feature so only a small percentage of the MergeRequest (<10%) and Issue (<3%) records are going to be queried on GitLab.com. By looking at the VSA usage, this number decreases a little bit (MergeRequest <5%, Issue <3%).
My idea is to collect and track the start and event timestamps in a separate table so the queries will be mostly INDEX ONLY SCANs. To deal with the large data volume, we would run the AVERAGE and MEDIAN queries with batching (keyset paginated) with some sort of limit to avoid reading too many rows per web request. For example: read maximum 10x10_000 rows (10 queries). If the limit is reached show message: truncated results, too much data.
The database structure for collecting stage data (for merge requests). Table name: analytics_merge_request_stage_events
| column | type | description |
|---|---|---|
| merge_request_id | bigint | composite primary key with stage_id |
| stage_id | bigint | points to the stage record where the start and end event is defined |
| start_event_timestamp | bigint | stores the start event timestamp for the stage |
| end_event_timestamp | bigint | stores the end event timestamp for the stage |
| group_id | bigint | Project's group |
| project_id | bigint | Project where the merge request was created |
This structure allows very fast queries where we filter the end event with a time range. Indexes:
# in the root group
(stage_id, end_event_timestamp, start_event_timestamp, merge_request_id)
# in a subgroup
(stage_id, group_id, end_event_timestamp, start_event_timestamp, merge_request_id)
# in a project
(stage_id, project_id, end_event_timestamp, start_event_timestamp, merge_request_id)
Query example:
SELECT AVG(end_event_timestamp - start_event_timestamp)
FROM analytics_merge_request_stage_events
WHERE end_event_timestamp BETWEEN '2020-01-01' AND '2020-06-06'
AND stage_id = 5
The merge_request_id would provide the tie-breaker behaviour when listing records (pagination) and also could be used when aggregating in batches (avoiding statement timeout). For example count and sum the duration for 10000 records. The query would run N times and we could calculate the average in Ruby:
# N = 2
count_batch_1 = 10000
sum_batch_1 = 123123123
count_batch_2 = 15
sum_batch_2 = 123
average = (sum_batch_1 + sum_batch_2) / (count_batch_1 + count_batch_2)
Finders and queries
The current IssuableFinder adds extra JOINs and EXISTS queries to perform the access control (confidential issues, project authorization). Since VSA operates on the Group level, we could check group permissions on page load so we can avoid complicated queries.
Data separation
VSA is an analytical feature that ideally should be in its own database (Timescale?). Today, it's too early to make this move however, with the sharding group work things might be easier in the future. For now, what we can do is prepare for easy separation from the core database tables:
- Eventually consistent data.
- Asynchronous data collection.
- No foreign keys on the database columns.
We'll need a well-defined interface where the data boundaries are crossed (for example when merge_requests table is joined). In case our data is in a different database, we could "easily" modify the queries to work across database nodes (disable_joins).
Data collection
Stage records are associated with a Group or with a Project. By collecting all stages in the group hierarchy, we can build batched queries to insert new data in the analytics_merge_request_stage_events database table.
The following table keeps track of the latest processed updated_at column value: analytics_merge_request_stage_updates
| column | type | description |
|---|---|---|
| stage_id | bigint | primary key |
| last_updated_at | timestamp | stores the last processed merge_requests.updated_at
|
First run:
- Find the top-level
Groupfor the stage. - Build a batched (keyset paginated) query over the merge requests in the Group hierarchy. (PoC: !62379 (closed))
- Collect the timestamps that are required for the stages.
- Insert the rows into the
analytics_merge_request_stage_eventstable. - Insert a row with
stage_idandlast_updated_atvalues.
Second run:
- Find the top-level
Groupfor the stage. - Build a batched (keyset paginated) query over the merge requests in the Group hierarchy.
- Using the
last_updated_atcolumn, iterate over the recently updated records. - Insert the rows into the
analytics_merge_request_stage_eventstable. - Update the
last_updated_atcolumn.
Note: we might need to do multiple LEFT JOINs to gather the relevant timestamp columns from different tables however, this would not make the query significantly slower. The batch query would always read N MergeRequest rows.
Not sure this is the best idea. Currently, I don't have a "cheaper" way of detecting Issue or `MergeRequest changes in a group hierarchy. I don't think we have hosted plan specific pub-sub functionality.
Consistency
When a MergeRequest or an Issue record is updated, the record update process described above would "catch" the changes and the data would be updated eventually.
We also need to ensure consistency when the project is removed or some records were deleted. We can do that with a periodic job that scans through the analytics_merge_request_stage_events database table and makes an EXISTS query against the merge_requests table.
Table size
Since we might have multiple stages per group, we would store quite a lot's of data in the analytics_merge_request_stage_events database table (worst case: 5 rows per merge request record). To address table size concerns, we could do hash partitioning or time range partitioning (use time-decay pattern if Product is OK with keeping 1 or 2 years of data).
Cross joins
VSA allows filtering on labels, author, assignee, milestone. With the new structure, we would be able to query this data using JOINs. When we have complete separation (different data store), this data would need to be duplicated.
No FOSS
There is a free (FOSS) version of VSA with fixed stages, no customization allowed, project-level only. Collecting data for FOSS users would cost too much storage and processing power. In order to keep the data volume low, we cannot support the FOSS version. We can keep the old behavior (no timestamp filter change).