Skip to content
Snippets Groups Projects
Commit fb6af4f6 authored by Alex Ives's avatar Alex Ives :speech_balloon:
Browse files

Merge branch 'database-query-recording-blueprint' into 'master'

Add automated query analysis blueprint

See merge request !111490



Merged-by: Alex Ives's avatarAlex Ives <aives@gitlab.com>
Approved-by: default avatarDylan Griffith <dyl.griffith@gmail.com>
Approved-by: Alex Ives's avatarAlex Ives <aives@gitlab.com>
Reviewed-by: default avatarMatt Kasa <mkasa@gitlab.com>
Reviewed-by: default avatarMat Appelman <mappelman@gitlab.com>
Reviewed-by: default avatarRoger Woo <rwoo@gitlab.com>
Co-authored-by: default avatarMatt Kasa <mkasa@gitlab.com>
parents dd0bd481 5c77cdcb
No related branches found
No related tags found
1 merge request!111490Add automated query analysis blueprint
Pipeline #891392602 passed
---
status: proposed
creation-date: "2023-02-08"
authors: [ "@mattkasa", "@jon_jenkins" ]
coach: "@DylanGriffith"
approvers: [ "@rogerwoo", "@alexives" ]
owning-stage: "~devops::data_stores"
participating-stages: []
---
# Automated Query Analysis
## Problem Summary
Our overarching goal is to improve the reliability and throughput of GitLab’s
database review process. The current process requires merge request authors to
manually provide query plans and raw SQL when introducing new queries or
updating existing queries. This is both time consuming and error prone.
We believe we can improve operational efficiency by automatically identifying
and analyzing newly introduced SQL queries. This will reduce the risk of human
error, leading to improved system stability and an overall reduction in
performance regressions.
Our key success metric is a reduction in the number of manual actions required
by both code contributors and database reviewers, while maintaining a consistent
standard for database related code contributions.
## Goals
1. Replace the current process of the author manually obtaining SQL and query
plans with an automated process.
1. Decrease the incidence of performance regressions when poorly performing
queries are missed by a manual process.
1. Increase contributor and reviewer efficiency by automating the query testing
portion of database review.
## Challenges
- Capturing the number of SQL queries generated by an application the size of
`gitlab-org/gitlab` without causing an increase in CI time and/or resources
may present a challenge.
- Storing the number of SQL queries generated by an application the size of
`gitlab-org/gitlab` may consume large amounts of database storage.
## Opportunity
- Automated test suites already generate a large number of SQL queries, for
instance `rspec` test suites, that can be captured and used to perform
automated analysis.
- We already utilize `postgres.ai` to analyze query performance, and it has an
API that will allow us to automate the creation of database clones with
realistic production-like data in order to perform automated analysis.
- For customers who do not use something like `postgres.ai`, but who are
connecting to a test database in CI, we would use this connection to generate
query plans. The accuracy of these query plans will be affected by how
realistic the test data is, and can be improved by seeding the test database
with production-like data.
- By storing queries and their query plans, we can tokenize the query plan into
plan components, assign a cost and weight, then match those against a machine
learning model. We can build this model by generating query plans for queries
in our slow query logs, and assign actual cost and weight to their plan
components. This will allow us to leverage our corpus of queries and slow
query logs to predict the performance of arbitrary query text for other
applications and our customers.
## Proposal
We plan to automate the process of identifying new and changed database queries,
so that contributors and reviewers can more accurately and efficiently assess
the database performance impact of a code change.
We will capture queries generated as a side effect of running tests in CI,
normalize them, deduplicate them, analyze them using one or more analyzers, and
store them with their analyses and other metadata for future retrieval and
comparison.
We will post a comment to the originating merge request, containing a summary of
the new and changed queries, with links to their analyses, and highlighting any
queries that exceed established timing or other performance guidelines.
## Design and implementation details
### Iteration 1
In the first iteration we will focus on how we capture queries, including
normalization, deduplication, and storage. We must consider the performance and
resource impacts on CI pipelines during capture, and include things like
partitioning and time decay for the information we are storing.
#### Capturing queries
We will strive to limit the time and resource impacts on our CI pipelines as
much as possible. These are some of the options we will consider for capturing
queries:
- **Instrumenting `ActiveRecord` in `ruby`**
- Challenges:
- Only applies to `ruby` projects so it would not be applicable to projects
like `container-registry`.
- Has a non-zero impact on time and resources in CI pipelines (these impacts
can be observed in
[!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638))
- Opportunities:
- Simple and straightforward to implement.
- Allows access to more information (eg. stacktrace and calling locations).
- **Connection proxy with logging**
- Challenges:
- Adds complexity and possible performance overhead.
- Requires maintaining the code for the proxy.
- Opportunities:
- Allows us to customize the capture.
- Allows us to perform normalization/deduplication at capture time.
- **Built-in logging in `postgresql`**
- Challenges:
- Require adding a configuration to enable logging.
- May be difficult to obtain the resulting logs.
- Opportunities:
- Doesn't require maintaining any code.
- Light weight in terms of performance impact.
- **Capture from `pg_stat_statements`**
- Challenges:
- Requires creating the extension in the test database.
- Requires adding a configuration to set `pg_stat_statements.max` to a value
high enough to capture all queries.
- Consumes shared memory proportional to `pg_stat_statements.max`.
- Opportunities:
- Requires minimal code.
- Simple to obtain the data.
- Data is already normalized.
We have already built a proof of concept for instrumenting `ActiveRecord` in
`ruby` in
[!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638), so as a
first step we will benchmark the other capture methods against it and select the
best option.
#### Storing queries
For the next step of the first iteration we will use the proof of concept in
[!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638) as well
as any data gathered from testing other capture methods to estimate the number
of rows per project, and use the pipeline execution statistics for
`gitlab-org/gitlab` to estimate throughput. These estimates will allow us to
evaluate storage mechanisms that are suitable for our purpose.
Some of the storage mechanisms we plan to evaluate are:
- **In the `ci` database in the GitLab database instance**
- Challenges:
- Places additional strain on this resource for `GitLab.com`.
- Opportunities:
- Allows us to utilize existing authentication and access control in the form of `CI_JOB_TOKEN`.
- Allows us to leverage associations with `ci_builds` and `ci_pipelines`.
- Simplifies deployment for self-managed.
- **In a new decomposed database in the GitLab database instance**
- Challenges:
- Adds to required development and testing effort.
- Adds to deployment effort for `GitLab.com`.
- Opportunities:
- Isolates database performance impacts from the existing `main` and `ci` database instances.
- **In a new external service**
- Challenges:
- Adds to required development and testing effort.
- Adds to deployment effort for `GitLab.com` and for self-managed.
- Opportunities:
- Isolates performance impacts from `gitlab-org/gitlab`.
- Allows us to iterate faster without impacting the main application.
- **In ClickHouse**
- Challenges:
- Not yet available for self-managed.
- Opportunities:
- Isolates database performance impacts from the existing `main` and `ci` database instances.
An example database schema for storing queries:
```sql
CREATE TABLE queries (
created_at timestamp with time zone NOT NULL,
updated_at timestamp with time zone NOT NULL,
id bigint NOT NULL,
project_id bigint NOT NULL,
analysis_id bigint,
hash text,
sql text
);
CREATE TABLE pipeline_queries (
id bigint NOT NULL,
project_id bigint NOT NULL,
pipeline_id bigint NOT NULL,
query_id bigint NOT NULL
);
CREATE TABLE analyses (
created_at timestamp with time zone NOT NULL,
updated_at timestamp with time zone NOT NULL,
id bigint NOT NULL,
project_id bigint NOT NULL,
query_id bigint NOT NULL,
buffers int,
walltime int,
explain text,
analysis_url text
);
```
One possible method of partitioning a schema like the above example would be to
utilize
[sub-partitioning](https://github.com/pgpartman/pg_partman/blob/master/doc/pg_partman.md#sub-partitioning).
If we partition by `project_id` then by some interval of `updated_at`, and touch
the row when we see a query, we can store only queries that the codebase is
still executing, and prune partitions that only contain queries the code is no
longer generating.
### Iteration 2
In the second iteration we plan to identify new and changed queries, and post MR
comments containing a summary. We will begin soliciting feedback on the accuracy
and utility of the information, and improve or filter it to maximize it's
usefulness.
### Iteration 3+
In the third and following iterations we plan to automate query analysis using
one or more analyzers, store these analyses, and add them to the MR comments. We
also intend to re-evaluate our use of the database to store query information,
and the API to retrieve it, and potentially move this to an external service.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment