Commit e5f6799a authored by zli's avatar zli 5️⃣ Committed by François Rosé
Browse files

Add Code Review DB size reduction design doc

parent fb60ceba
Loading
Loading
Loading
Loading
+529 −0
Original line number Diff line number Diff line
---
title: "Code Review database size reduction for GitLab.com"
status: proposed
creation-date: "2026-04-15"
authors: [ "@zhaochen.li" ]
coaches: [ ]
dris: [ "@francoisrose", "@phikai", "@patrickbajao", "@dskim_gitlab" ]
owning-stage: "~devops::create"
participating-stages: []
toc_hide: true
---

{{< engineering/design-document-header >}}

## Summary

The Code Review group owns several of the largest tables on GitLab.com, including
`notes` (3,054 GB), `events` (2,371 GB), `merge_requests` (787 GB), and
`merge_request_diffs` (389 GB). Together these tables consume over 6,600 GB of
primary database storage and continue to grow, particularly as AI tooling
accelerates record creation.

This document proposes approaches to reduce the on-disk size of these
tables by approximately 50% through a combination of strategies: clearing cached
HTML fields for stale records, converting column types to more compact
representations, removing redundant columns and indexes, decomposing position
data into structured tables, enforcing retention policies, and reclaiming table
bloat. The strategies are prioritized by estimated savings, implementation effort,
and risk.

Related: [Blueprint for Code Review database size reduction (&20233)](https://gitlab.com/groups/gitlab-org/-/epics/20233),
[Code Review database size reduction (#17571)](https://gitlab.com/groups/gitlab-org/-/work_items/17571),
[Initial spike for database size reduction blueprint (#586185)](https://gitlab.com/gitlab-org/gitlab/-/issues/586185).

## Motivation

Large tables on GitLab.com are a major problem for both operations and
development. As tables grow beyond hundreds of gigabytes, several problems
compound:

1. **Query performance suffers.** Larger tables increase index sizes, slow down
   sequential scans, and reduce buffer cache hit rates.
1. **Table maintenance becomes expensive.** `VACUUM`, `ANALYZE`, and index
   rebuilds take longer and hold locks that affect application availability.
1. **Infrastructure costs increase.** Storage, I/O, replication lag, and backup
   times all scale with on-disk size.
1. **Data migrations become complex.** Schema changes on large tables require
   significantly more effort to implement and are more likely to cause stability
   problems on GitLab.com. For example, swapping bigint columns with integer
   columns on `merge_requests` had to be split into 3 stages and took several
   months to complete
   ([#507695](https://gitlab.com/gitlab-org/gitlab/-/work_items/507695)).
1. **Operational risk grows.** Failovers and disaster recovery become slower and
   more fragile as data volume increases.

The [Database Scalability blueprint](/handbook/engineering/architecture/design-documents/database_size_limits/)
(June 2021) established a target of keeping individual physical tables under
100 GB on GitLab.com. Nearly five years later, multiple Code Review tables on GitLab.com still
exceed this threshold by 7x to 30x and have grown significantly since the
original analysis. Without intervention, these tables will continue to grow as
GitLab.com usage increases.

The tables over 100 GB on GitLab.com owned by or closely related to Code Review,
as of January 2026, are:

| Table | Size |
|---|---|
| `merge_request_diff_commits` | 7,875 GB |
| `merge_request_diff_files` | 3,290 GB |
| `notes` | 3,156 GB |
| `events` | 2,371 GB |
| `merge_requests` | 787 GB |
| `merge_request_diffs` | 451 GB |
| `note_diff_files` | 170 GB |
| `approval_merge_request_rules_users` | 160 GB |
| `merge_request_metrics` | 140 GB |

This document focuses on the remaining large tables after excluding the items
listed in [Non-Goals](#non-goals) below: `notes`, `events`, `merge_requests`,
and `merge_request_diffs`.

### Goals

1. Reduce the combined on-disk size of `notes`, `merge_requests`,
   `merge_request_diffs`, and `events` tables by approximately 50%.
1. Achieve the largest savings with the lowest-risk changes first (quick wins),
   then progress to larger structural changes.
1. Maintain backward compatibility with existing application behavior with no
   user-facing feature regressions.
1. Establish repeatable patterns (for example, HTML cache clearing, retention policies) that
   other groups can adopt for their own large tables such as `issues` and
   `work_items`.
1. Update application and model code when needed to support larger structural
   changes (for example, table decomposition, column type conversions).
1. Deliver changes incrementally across multiple milestones, with each change
   independently valuable.

### Non-Goals

This document does not cover the two largest Code Review tables,
`merge_request_diff_commits` and `merge_request_diff_files`. Those tables are
already being addressed by separate epics:

- [Reduce the growth and size of the merge_request_diff_commits (&16385)](https://gitlab.com/groups/gitlab-org/-/epics/16385)
- [Partition and reduce size of merge_request_diff_files (&11272)](https://gitlab.com/groups/gitlab-org/-/epics/11272)

See [Out-of-scope opportunities](#out-of-scope-opportunities) below for other
opportunities identified during the investigation that are not initially in
scope.

## Proposal

The table below summarizes all in-scope opportunities in the order we intend to
pursue them, starting with the largest savings. We will iterate through these
incrementally, prioritizing by effort-to-impact ratio so small-effort "quick
wins" can be delivered in parallel with larger structural changes. A concrete
priority order will be proposed in follow-up MRs. Detailed
analysis for each opportunity follows in the per-table sections.

| Opportunity | Table | Effort | Savings |
|---|---|---|---|
| Clear `note_html` for stale MRs | `notes` | Large | 1,000 GB |
| Decompose system notes | `notes` | Large | 800 GB |
| Convert position columns to structured table | `notes` | Large | 200 GB |
| Clear `description_html` and `title_html` for stale MRs | `merge_requests` | Large | 150 GB |
| Retention policy on `merge_request_diffs` | `merge_request_diffs`, `merge_request_diff_commits`, `merge_request_diff_files` | Large | TBD (expected large) |
| Reclaim bloat (`pg_repack`) | `merge_requests` | Small | 123 GB |
| Convert SHA columns to `bytea` | `merge_request_diffs` | Small | 78 GB |
| Drop redundant noteable index | `notes` | Small | 63 GB |
| Drop `external_diff` column and index | `merge_request_diffs` | Small | 52 GB |
| Drop `updated_at` column | `events` | Small | 34 GB |
| Drop/convert `index_notes_on_line_code` | `notes` | Small | 34 GB |
| Remove `merge_params` for merged MRs | `merge_requests` | Small | 25 GB |
| Convert `index_notes_on_organization_id` to partial | `notes` | Small | 19 GB |
| Convert SHA columns to `bytea` | `merge_requests` | Small | 15 GB |
| Convert integer columns to smaller types | `merge_request_diffs` | Small | 10 GB |
| Convert `merge_status` to `smallint` | `merge_requests` | Small | 3.5 GB |
| Drop `assignee_id` column and index | `merge_requests` | Small | ~2.7 GB |
| **Total** | | | **~2,604 GB** |

**Retention policy on `merge_request_diffs`.** Discussed in
  [issue #594843 (comment)](https://gitlab.com/gitlab-org/gitlab/-/issues/594843#note_3194219248).
  We expect savings to be large because a retention policy on
  `merge_request_diffs` would also reduce `merge_request_diff_commits` and
  `merge_request_diff_files`, but this overlaps with the separate epics already
  addressing those tables
  ([epic &16385](https://gitlab.com/groups/gitlab-org/-/epics/16385) and
  [epic &11272](https://gitlab.com/groups/gitlab-org/-/epics/11272)) and needs
  to be coordinated there. A savings estimate should be produced as part of
  that coordination.

### Out-of-scope opportunities

The following opportunities were identified during the investigation but are
not in scope for this design document. Each is documented here for visibility
and future follow-up:

| Opportunity | Table | Effort | Savings |
|---|---|---|---|
| 90-day retention policy | `events` | Large | 1,800 GB |
| Partition `events` table | `events` | Medium | 0 GB (enabler) |
| Merge namespace columns | `events` | Large | 50 GB |
| Drop `st_diff` column | `notes` | Medium | 20 GB |
| Drop `confidential` column | `notes` | Small | ~0.1 GB |

These are out of scope for the following reasons:

- **`events` table changes (90-day retention, partitioning, and merging
  namespace columns).** These approaches are not yet mature enough to commit
  to in this document. Code Review co-owns the `events` table because it
  contains MR-related data, but many of the features that rely on this data
  are owned by other groups, so changes here require cross-team collaboration
  and further POC work to validate feasibility, impact, and retention
  semantics. The
  [retention policy proposal (#571288)](https://gitlab.com/gitlab-org/gitlab/-/issues/571288)
  is the current starting point for that discussion.
- **Drop `st_diff` column.** Requires removing `LegacyDiffNote` handling
  across the application, which has a broader scope than a column-level change
  and is better tracked as a separate deprecation.
- **Drop `confidential` column.** Savings are negligible (~0.1 GB) and do not
  justify prioritizing this over larger opportunities. Can be finalized
  opportunistically alongside other `notes` changes.

The following sections provide detailed analysis for each opportunity, organized
by table. Each includes context from the
[initial spike investigation](https://gitlab.com/gitlab-org/gitlab/-/issues/586185).

### `notes` table (3,054 GB total: 2,246 GB columns + 808 GB indexes)

The `notes` table is the third largest table on GitLab.com. The top six columns
by size are `note_html`, `note`, `original_position`,
`position`, `discussion_id`, and `change_position`.

#### Clear `note_html` for stale merge requests

- **Estimated savings:** 1,000 GB (44.5% of `notes` table)
- **Effort:** Large

`note_html` is a cached rendered version of the `note` field, generated by
`CacheMarkdownField`. It consumes 1,169 GB (52% of the table). For notes on
merge requests that have not been accessed recently, this cached value can be
cleared and regenerated on demand.

The approach:

1. Define "stale" criteria: for merged MRs, for example `updated_at` older than 3 months;
   for open MRs, `updated_at` older than 6 months.
1. Run an async worker to set `note_html` to `NULL` for notes belonging to stale
   MRs.
1. On read, if `note_html` is `NULL`, regenerate from `note` and persist back to
   the database. The existing `CacheMarkdownField` module already supports this
   pattern through `cached_markdown_version`.
1. Benchmark the performance impact of on-the-fly regeneration for API endpoints
   that return many notes (for example, merge request discussions API).

There have been past markdown cache version bumps (for example, `492f0853`, `e7a98807`) which
essentially trigger the same regeneration, suggesting the performance impact
is manageable.

This pattern can later be applied to `description_html` and `title_html` on
`merge_requests`, `issues`, and `work_items` tables. Nicolas Dular from the Plan
group has expressed support for this approach and noted it could also benefit the
`issues` table.

#### Decompose system notes

- **Estimated savings:** 800 GB (35.6% of `notes` table)
- **Effort:** Large

Approximately 79% of notes rows are system-generated notes. Unlike user-authored
notes, system notes are mostly structured data rendered into full text (for
example, "added 3 commits", "changed the description", "mentioned in !1234"). Instead of
storing the full rendered text, we can store only the structured parameters
needed to reconstruct the message on the fly (for example, action type, count, reference).

There are two possible approaches:

- **Decompose into a new table.** Move system notes into a dedicated
  `system_notes` table with structured columns for each action type. This
  reduces the effective size of both the original and new tables, improving
  query performance for each access pattern.
- **Add structured columns to the existing table.** Add columns for the
  structured parameters and clear the `note` and `note_html` text fields for
  system notes, avoiding the complexity of a table split while still reclaiming
  the storage.

#### Convert `position`, `original_position`, and `change_position` to a structured table

- **Estimated savings:** 200 GB (3.9% of `notes` table)
- **Effort:** Large

These three columns store YAML-serialized position data for `DiffNote` records.
Currently each field is a YAML string consuming approximately 520 bytes per row.
Converting to a structured table (similar to the existing `DiffNotePosition`
model and its `diff_note_positions` table) reduces storage to approximately
350 bytes per row.

Additionally, `position` holds the exact same data as `original_position` for
approximately 2.28% of rows. This redundancy can be eliminated.

We investigated whether converting from YAML strings to `jsonb` would help, but
YAML strings actually use less space than `jsonb` due to TOAST compression.
The structured table approach provides the best savings.

#### Drop redundant `noteable_id`/`noteable_type`/`system` index

- **Estimated savings:** 63 GB (2.1% of `notes` table)
- **Effort:** Small

The composite index `index_notes_on_noteable_id_and_noteable_type_and_system` is
63 GB and has minimal usage. We need to evaluate whether queries can be served by other existing
indexes before removal.

#### Drop or convert `index_notes_on_line_code` to partial

- **Estimated savings:** 34 GB (1.1% of `notes` table)
- **Effort:** Small

This index is 36 GB. Grafana metrics show it is seldom used (fewer than 0.8
scans per second with occasional spikes). The only usage found is for
`LegacyDiffNote`, which is a legacy implementation (new diff notes are of type
`DiffNote`, and mainly `import` still creates the legacy type). This index can be
converted to a partial index or dropped after confirming no active query paths
depend on it.

#### Drop `st_diff` column and remove `LegacyDiffNote` type

- **Estimated savings:** 20 GB (0.7% of `notes` table)
- **Effort:** Medium

The `st_diff` field is only used by `LegacyDiffNote`. We can standardize the
legacy notes type and remove this column, saving approximately 20 GB.

#### Convert `index_notes_on_organization_id` to partial

- **Estimated savings:** 19 GB (0.6% of `notes` table)
- **Effort:** Small

This index is 19 GB but has near-zero usage over the past 90 days. The column
`organization_id` is almost entirely `NULL` (only 451 KB of actual data). There
is no actual usage in application code; the index exists solely for the
Cells/Organization sharding initiative. Converting to a partial index on
`WHERE organization_id IS NOT NULL` shrinks the index from 19 GB to near zero.
This requires confirmation from the Tenant Scale team.

#### Drop `confidential` column (migrated to `internal`)

- **Estimated savings:** ~0.1 GB
- **Effort:** Small

The `confidential` column is a duplicate of `internal` after the migration in
[Rename confidential column in notes tables (#367923)](https://gitlab.com/gitlab-org/gitlab/-/issues/367923).
Finalize the migration by dropping the column and the associated
`index_notes_on_id_where_confidential` index (22 MB).

#### Total estimated savings for `notes`: ~1,700 GB (56%)

### `merge_requests` table (787 GB total: 551 GB columns + 254 GB indexes)

The `merge_requests` table total size is 804 GB (including ~123 GB of
reclaimable bloat). The top columns by size are `description` and
`description_html` (251 GB, 45.6%), `title` and `title_html` (41 GB, 7.45%),
and `merge_params` (26 GB, 4.73%).

#### Clear `description_html` and `title_html` for stale merge requests

- **Estimated savings:** 150 GB (19.9% of `merge_requests` table)
- **Effort:** Large

`description_html` consumes 160 GB and `title_html` consumes 27 GB. Both are
`CacheMarkdownField` caches, not sources of truth. Only `title` and
`description` are the source of truth. The same approach described for
`note_html` above applies here.

#### Reclaim table bloat (`pg_repack`)

- **Estimated savings:** 123 GB (18.8% of `merge_requests` physical size)
- **Effort:** Small (requires DB team coordination)

Analysis shows a 151 GB difference between the physical table size (551 GB) and
the actual column data (400 GB). This is attributed to: table bloat (~123 GB of
reclaimable dead tuples), row metadata (~11 GB), and alignment padding and page
headers (~17 GB). The bloat is likely caused by the bigint migration and
description updates. Running `pg_repack` or `VACUUM FULL` can reclaim this space,
coordinated with the Database team for production execution.

#### Remove `merge_params` for merged merge requests

- **Estimated savings:** 25 GB (3.8% of `merge_requests` table)
- **Effort:** Small

`merge_params` contains highly repetitive data. For example,
`force_remove_source_branch: '0'` is the default behavior for any MR, yet it is
persisted for every row. After an MR is merged, `merge_params` is no longer
needed by the application. Most of this data is also available in Gitaly if
needed later.

Discussion with the Code Review backend team on Slack confirmed there is no known
usage of `merge_params` after the MR is merged. We can run an async worker
daily or weekly to clear `merge_params` for MRs merged more than 7 days ago,
reducing the column from 26 GB to under 1 GB. The `merge_params` field could
also be converted to `jsonb` if we choose not to decompose it into a separate
table.

#### Convert SHA columns to `bytea`

- **Estimated savings:** 15 GB (1.9% of `merge_requests` table)
- **Effort:** Small

Three SHA fields are stored as `character varying`, which uses a hex-encoded
text representation:

- `squash_commit_sha`: varchar field takes 42 bytes, `bytea` takes 20 bytes.
  65.7M rows x 22 bytes saved = 1.31 GB.
- `merge_commit_sha`: varchar field takes 42 bytes, `bytea` takes 20 bytes.
  272.3M rows x 22 bytes saved = 5.99 GB.
- `merged_commit_sha`: varchar field takes 82 bytes (double-encoded), `bytea`
  takes 20 bytes. 121.4M rows x 62 bytes saved = 7.53 GB.

This standardizes SHA storage. The existing `in_progress_merge_commit_sha`
column already uses `bytea`, so there is precedent in this table.

#### Convert `merge_status` from `varchar` to `smallint`

- **Estimated savings:** 3.5 GB (0.5% of `merge_requests` table)
- **Effort:** Small

`merge_status` is defined as `character varying(510)` but only stores 7 possible
enum values (`unchecked`, `preparing`, `checking`, `can_be_merged`,
`cannot_be_merged`, `cannot_be_merged_recheck`, `cannot_be_merged_rechecking`).
Each row consumes approximately 11 bytes. Converting to `smallint` (2 bytes)
with a Rails enum mapping saves approximately 10 bytes per row, with no change
above the model layer.

#### Drop legacy `assignee_id` column

- **Estimated savings:** ~2.7 GB (column + index)
- **Effort:** Small

The `assignee_id` column has been replaced by `merge_request_assignees`
association. The column itself is only 43 MB, but the associated index
`index_merge_requests_on_assignee_id` is 2.62 GB. Both can be dropped after
confirming the deprecation is complete.

#### Total estimated savings for `merge_requests`: ~311.5 GB (47.6%)

### `merge_request_diffs` table (389 GB total: 203 GB columns + 186 GB indexes)

#### Drop `external_diff` column and index

- **Estimated savings:** 52 GB (13.4% of `merge_request_diffs` table)
- **Effort:** Small

The `external_diff` column is no longer populated. It was computed on the
Carrierwave side. The column and its associated index
`index_merge_request_diffs_on_external_diff` (14 GB) can be removed, saving
approximately 52 GB total.

#### Convert 3 SHA columns to `bytea`

- **Estimated savings:** 78 GB (20.1% of `merge_request_diffs` table)
- **Effort:** Small

`base_commit_sha`, `start_commit_sha`, and `head_commit_sha` can be converted
from `character varying` to `bytea`, following the same approach as the
`merge_requests` SHA columns. Each index on these columns also shrinks by
approximately one-third.

#### Convert `real_size`, `state`, `external_diff_store`, and `commits_count` to smaller integer types

- **Estimated savings:** 10 GB (2.6% of `merge_request_diffs` table)
- **Effort:** Small

These columns currently use larger integer types than necessary. Converting to
1-byte or 2-byte integers where the value range permits saves approximately
10 GB.

#### Total estimated savings for `merge_request_diffs`: ~140 GB (36%)

### `events` table (2,371 GB)

The `events` table has a clean schema with limited optimization potential at the
column or index level. The table definition is well-designed, and the content
looks well-structured in terms of what events we store. The primary savings
opportunities come from data lifecycle management.

#### Drop `updated_at` column

- **Estimated savings:** 34 GB (1.4% of `events` table)
- **Effort:** Small

Events are append-only and immutable. Analysis shows only 0.02% of rows have
different `created_at` and `updated_at` values, and most of those differ by only
nanoseconds or milliseconds. Deeper investigation by Abdul Wadood confirmed that
rows where `updated_at` and `created_at` differ by more than 10 seconds have not
occurred in the last year (the last such rows are from 2024).

There is no index on `updated_at`, which implies it is not actively used for
queries. However, as Shane Maglangit noted, absence of an index does not
definitively prove the column is unused (for example, `namespaces.updated_at` is heavily
used but has no index). We should perform a double-check of application code
before action. If needed for backward compatibility, we can alias `updated_at`
to `created_at` in Rails.

#### Merge `project_id`, `group_id`, and `personal_namespace_id` into `namespace_id`

- **Estimated savings:** 50 GB (2.1% of `events` table)
- **Effort:** Large

The events table stores three separate columns for the owning entity. These
could potentially be merged into a single `namespace_id` column. However,
merging these columns would require joining with the `namespaces` table to find
events (since `projects` is a different table from `namespaces`), which would
slow down already slow event queries. This needs careful benchmarking before
proceeding.

#### 90-day retention policy

- **Estimated savings:** 1,800 GB (75.9% of `events` table)
- **Effort:** Large

This is the single largest opportunity across all tables. Both GitHub and Azure
DevOps offer 90-day event retention. A similar policy would dramatically reduce
the `events` table size. Christina Lohr (`@lohrc`) has
[recommended a 90-day retention period](https://gitlab.com/gitlab-org/gitlab/-/issues/571288),
focused on the fact that `events` is essentially a duplicate of data that can be
found in or reconstructed from other tables.

This strategy requires:

1. Product input on acceptable retention periods.
1. Evaluation of whether historical events can be reconstructed from other data
   sources. Notably, `push_event_payloads` is a highly-entangled table that may
   not be reconstructable, and we need to consider whether we can delete events
   for open issues and MRs.
1. Time-based partitioning of the `events` table to enable efficient partition
   dropping rather than row-by-row deletion (the
   time-decay data pattern describes this
   approach in detail).
1. Archiving events older than 90 days to object storage or a data warehouse for
   compliance and analytics use cases.

One additional idea: identify bot and automation actions and treat them
differently (fewer fields to store, or not saving to DB at all). This could
significantly reduce the table's growth rate, especially as AI tooling increases
record creation. This needs Product input first.

#### Table partitioning

- **Estimated savings:** 0 GB direct (enables retention and improves maintenance)
- **Effort:** Medium

Even without a retention policy, partitioning the `events` table by `created_at`
improves query performance, vacuum efficiency, and maintenance operations. It is
a prerequisite for implementing an efficient retention policy.

Kerri Miller noted that a blend of approaches will serve best in the long term,
and that partitioning might be appropriate even with a short retention period, as
we can use time-based partitions and drop the oldest one every 90 days.

#### Total estimated savings for `events`: ~1,884 GB (if retention is adopted)

## Design and implementation details

TBD

## Alternative Solutions

TBD