Merge branch 'database-query-recording-blueprint' into 'master'

Add automated query analysis blueprint See merge request !111490 Merged-by: Alex Ives <aives@gitlab.com> Approved-by: Dylan Griffith <dyl.griffith@gmail.com> Approved-by: Alex Ives <aives@gitlab.com> Reviewed-by: Matt Kasa <mkasa@gitlab.com> Reviewed-by: Mat Appelman <mappelman@gitlab.com> Reviewed-by: Roger Woo <rwoo@gitlab.com> Co-authored-by: Matt Kasa <mkasa@gitlab.com>

Merge branch 'database-query-recording-blueprint' into 'master'
fb6af4f6 · Alex Ives · dd0bd481 · 5c77cdcb · fb6af4f6
Commit fb6af4f6 authored 1 year ago by Alex Ives
--- a/doc/architecture/blueprints/database/automated_query_analysis/index.md
+++ b/doc/architecture/blueprints/database/automated_query_analysis/index.md
+---
+status: proposed
+creation-date: "2023-02-08"
+authors: [ "@mattkasa", "@jon_jenkins" ]
+coach: "@DylanGriffith"
+approvers: [ "@rogerwoo", "@alexives" ]
+owning-stage: "~devops::data_stores"
+participating-stages: []
+---
+
+# Automated Query Analysis
+
+## Problem Summary
+
+Our overarching goal is to improve the reliability and throughput of GitLab’s
+database review process. The current process requires merge request authors to
+manually provide query plans and raw SQL when introducing new queries or
+updating existing queries. This is both time consuming and error prone.
+
+We believe we can improve operational efficiency by automatically identifying
+and analyzing newly introduced SQL queries. This will reduce the risk of human
+error, leading to improved system stability and an overall reduction in
+performance regressions.
+
+Our key success metric is a reduction in the number of manual actions required
+by both code contributors and database reviewers, while maintaining a consistent
+standard for database related code contributions.
+
+## Goals
+
+1. Replace the current process of the author manually obtaining SQL and query
+   plans with an automated process.
+1. Decrease the incidence of performance regressions when poorly performing
+   queries are missed by a manual process.
+1. Increase contributor and reviewer efficiency by automating the query testing
+   portion of database review.
+
+## Challenges
+
+- Capturing the number of SQL queries generated by an application the size of
+  `gitlab-org/gitlab` without causing an increase in CI time and/or resources
+  may present a challenge.
+- Storing the number of SQL queries generated by an application the size of
+  `gitlab-org/gitlab` may consume large amounts of database storage.
+
+## Opportunity
+
+- Automated test suites already generate a large number of SQL queries, for
+  instance  `rspec` test suites, that can be captured and used to perform
+  automated analysis.
+- We already utilize `postgres.ai` to analyze query performance, and it has an
+  API that will allow us to automate the creation of database clones with
+  realistic production-like data in order to perform automated analysis.
+- For customers who do not use something like `postgres.ai`, but who are
+  connecting to a test database in CI, we would use this connection to generate
+  query plans. The accuracy of these query plans will be affected by how
+  realistic the test data is, and can be improved by seeding the test database
+  with production-like data.
+- By storing queries and their query plans, we can tokenize the query plan into
+  plan components, assign a cost and weight, then match those against a machine
+  learning model. We can build this model by generating query plans for queries
+  in our slow query logs, and assign actual cost and weight to their plan
+  components. This will allow us to leverage our corpus of queries and slow
+  query logs to predict the performance of arbitrary query text for other
+  applications and our customers.
+
+## Proposal
+
+We plan to automate the process of identifying new and changed database queries,
+so that contributors and reviewers can more accurately and efficiently assess
+the database performance impact of a code change.
+
+We will capture queries generated as a side effect of running tests in CI,
+normalize them, deduplicate them, analyze them using one or more analyzers, and
+store them with their analyses and other metadata for future retrieval and
+comparison.
+
+We will post a comment to the originating merge request, containing a summary of
+the new and changed queries, with links to their analyses, and highlighting any
+queries that exceed established timing or other performance guidelines.
+
+## Design and implementation details
+
+### Iteration 1
+
+In the first iteration we will focus on how we capture queries, including
+normalization, deduplication, and storage. We must consider the performance and
+resource impacts on CI pipelines during capture, and include things like
+partitioning and time decay for the information we are storing.
+
+#### Capturing queries
+
+We will strive to limit the time and resource impacts on our CI pipelines as
+much as possible. These are some of the options we will consider for capturing
+queries:
+
+- **Instrumenting `ActiveRecord` in `ruby`**
+  - Challenges:
+    - Only applies to `ruby` projects so it would not be applicable to projects
+      like `container-registry`.
+    - Has a non-zero impact on time and resources in CI pipelines (these impacts
+      can be observed in
+      [!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638))
+  - Opportunities:
+    - Simple and straightforward to implement.
+    - Allows access to more information (eg. stacktrace and calling locations).
+- **Connection proxy with logging**
+  - Challenges:
+    - Adds complexity and possible performance overhead.
+    - Requires maintaining the code for the proxy.
+  - Opportunities:
+    - Allows us to customize the capture.
+    - Allows us to perform normalization/deduplication at capture time.
+- **Built-in logging in `postgresql`**
+  - Challenges:
+    - Require adding a configuration to enable logging.
+    - May be difficult to obtain the resulting logs.
+  - Opportunities:
+    - Doesn't require maintaining any code.
+    - Light weight in terms of performance impact.
+- **Capture from `pg_stat_statements`**
+  - Challenges:
+    - Requires creating the extension in the test database.
+    - Requires adding a configuration to set `pg_stat_statements.max` to a value
+      high enough to capture all queries.
+    - Consumes shared memory proportional to `pg_stat_statements.max`.
+  - Opportunities:
+    - Requires minimal code.
+    - Simple to obtain the data.
+    - Data is already normalized.
+
+We have already built a proof of concept for instrumenting `ActiveRecord` in
+`ruby` in
+[!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638), so as a
+first step we will benchmark the other capture methods against it and select the
+best option.
+
+#### Storing queries
+
+For the next step of the first iteration we will use the proof of concept in
+[!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638) as well
+as any data gathered from testing other capture methods to estimate the number
+of rows per project, and use the pipeline execution statistics for
+`gitlab-org/gitlab` to estimate throughput. These estimates will allow us to
+evaluate storage mechanisms that are suitable for our purpose.
+
+Some of the storage mechanisms we plan to evaluate are:
+
+- **In the `ci` database in the GitLab database instance**
+  - Challenges:
+    - Places additional strain on this resource for `GitLab.com`.
+  - Opportunities:
+    - Allows us to utilize existing authentication and access control in the form of `CI_JOB_TOKEN`.
+    - Allows us to leverage associations with `ci_builds` and `ci_pipelines`.
+    - Simplifies deployment for self-managed.
+- **In a new decomposed database in the GitLab database instance**
+  - Challenges:
+    - Adds to required development and testing effort.
+    - Adds to deployment effort for `GitLab.com`.
+  - Opportunities:
+    - Isolates database performance impacts from the existing `main` and `ci` database instances.
+- **In a new external service**
+  - Challenges:
+    - Adds to required development and testing effort.
+    - Adds to deployment effort for `GitLab.com` and for self-managed.
+  - Opportunities:
+    - Isolates performance impacts from `gitlab-org/gitlab`.
+    - Allows us to iterate faster without impacting the main application.
+- **In ClickHouse**
+  - Challenges:
+    - Not yet available for self-managed.
+  - Opportunities:
+    - Isolates database performance impacts from the existing `main` and `ci` database instances.
+
+An example database schema for storing queries:
+
+```sql
+CREATE TABLE queries (
+    created_at timestamp with time zone NOT NULL,
+    updated_at timestamp with time zone NOT NULL,
+    id bigint NOT NULL,
+    project_id bigint NOT NULL,
+    analysis_id bigint,
+    hash text,
+    sql text
+);
+CREATE TABLE pipeline_queries (
+    id bigint NOT NULL,
+    project_id bigint NOT NULL,
+    pipeline_id bigint NOT NULL,
+    query_id bigint NOT NULL
+);
+CREATE TABLE analyses (
+    created_at timestamp with time zone NOT NULL,
+    updated_at timestamp with time zone NOT NULL,
+    id bigint NOT NULL,
+    project_id bigint NOT NULL,
+    query_id bigint NOT NULL,
+    buffers int,
+    walltime int,
+    explain text,
+    analysis_url text
+);
+```
+
+One possible method of partitioning a schema like the above example would be to
+utilize
+[sub-partitioning](https://github.com/pgpartman/pg_partman/blob/master/doc/pg_partman.md#sub-partitioning).
+If we partition by `project_id` then by some interval of `updated_at`, and touch
+the row when we see a query, we can store only queries that the codebase is
+still executing, and prune partitions that only contain queries the code is no
+longer generating.
+
+### Iteration 2
+
+In the second iteration we plan to identify new and changed queries, and post MR
+comments containing a summary. We will begin soliciting feedback on the accuracy
+and utility of the information, and improve or filter it to maximize it's
+usefulness.
+
+### Iteration 3+
+
+In the third and following iterations we plan to automate query analysis using
+one or more analyzers, store these analyses, and add them to the MR comments. We
+also intend to re-evaluate our use of the database to store query information,
+and the API to retrieve it, and potentially move this to an external service.