Loading content/handbook/engineering/architecture/design-documents/code_review_database_size_reduction/_index.md 0 → 100644 +529 −0 Original line number Diff line number Diff line --- title: "Code Review database size reduction for GitLab.com" status: proposed creation-date: "2026-04-15" authors: [ "@zhaochen.li" ] coaches: [ ] dris: [ "@francoisrose", "@phikai", "@patrickbajao", "@dskim_gitlab" ] owning-stage: "~devops::create" participating-stages: [] toc_hide: true --- {{< engineering/design-document-header >}} ## Summary The Code Review group owns several of the largest tables on GitLab.com, including `notes` (3,054 GB), `events` (2,371 GB), `merge_requests` (787 GB), and `merge_request_diffs` (389 GB). Together these tables consume over 6,600 GB of primary database storage and continue to grow, particularly as AI tooling accelerates record creation. This document proposes approaches to reduce the on-disk size of these tables by approximately 50% through a combination of strategies: clearing cached HTML fields for stale records, converting column types to more compact representations, removing redundant columns and indexes, decomposing position data into structured tables, enforcing retention policies, and reclaiming table bloat. The strategies are prioritized by estimated savings, implementation effort, and risk. Related: [Blueprint for Code Review database size reduction (&20233)](https://gitlab.com/groups/gitlab-org/-/epics/20233), [Code Review database size reduction (#17571)](https://gitlab.com/groups/gitlab-org/-/work_items/17571), [Initial spike for database size reduction blueprint (#586185)](https://gitlab.com/gitlab-org/gitlab/-/issues/586185). ## Motivation Large tables on GitLab.com are a major problem for both operations and development. As tables grow beyond hundreds of gigabytes, several problems compound: 1. **Query performance suffers.** Larger tables increase index sizes, slow down sequential scans, and reduce buffer cache hit rates. 1. **Table maintenance becomes expensive.** `VACUUM`, `ANALYZE`, and index rebuilds take longer and hold locks that affect application availability. 1. **Infrastructure costs increase.** Storage, I/O, replication lag, and backup times all scale with on-disk size. 1. **Data migrations become complex.** Schema changes on large tables require significantly more effort to implement and are more likely to cause stability problems on GitLab.com. For example, swapping bigint columns with integer columns on `merge_requests` had to be split into 3 stages and took several months to complete ([#507695](https://gitlab.com/gitlab-org/gitlab/-/work_items/507695)). 1. **Operational risk grows.** Failovers and disaster recovery become slower and more fragile as data volume increases. The [Database Scalability blueprint](/handbook/engineering/architecture/design-documents/database_size_limits/) (June 2021) established a target of keeping individual physical tables under 100 GB on GitLab.com. Nearly five years later, multiple Code Review tables on GitLab.com still exceed this threshold by 7x to 30x and have grown significantly since the original analysis. Without intervention, these tables will continue to grow as GitLab.com usage increases. The tables over 100 GB on GitLab.com owned by or closely related to Code Review, as of January 2026, are: | Table | Size | |---|---| | `merge_request_diff_commits` | 7,875 GB | | `merge_request_diff_files` | 3,290 GB | | `notes` | 3,156 GB | | `events` | 2,371 GB | | `merge_requests` | 787 GB | | `merge_request_diffs` | 451 GB | | `note_diff_files` | 170 GB | | `approval_merge_request_rules_users` | 160 GB | | `merge_request_metrics` | 140 GB | This document focuses on the remaining large tables after excluding the items listed in [Non-Goals](#non-goals) below: `notes`, `events`, `merge_requests`, and `merge_request_diffs`. ### Goals 1. Reduce the combined on-disk size of `notes`, `merge_requests`, `merge_request_diffs`, and `events` tables by approximately 50%. 1. Achieve the largest savings with the lowest-risk changes first (quick wins), then progress to larger structural changes. 1. Maintain backward compatibility with existing application behavior with no user-facing feature regressions. 1. Establish repeatable patterns (for example, HTML cache clearing, retention policies) that other groups can adopt for their own large tables such as `issues` and `work_items`. 1. Update application and model code when needed to support larger structural changes (for example, table decomposition, column type conversions). 1. Deliver changes incrementally across multiple milestones, with each change independently valuable. ### Non-Goals This document does not cover the two largest Code Review tables, `merge_request_diff_commits` and `merge_request_diff_files`. Those tables are already being addressed by separate epics: - [Reduce the growth and size of the merge_request_diff_commits (&16385)](https://gitlab.com/groups/gitlab-org/-/epics/16385) - [Partition and reduce size of merge_request_diff_files (&11272)](https://gitlab.com/groups/gitlab-org/-/epics/11272) See [Out-of-scope opportunities](#out-of-scope-opportunities) below for other opportunities identified during the investigation that are not initially in scope. ## Proposal The table below summarizes all in-scope opportunities in the order we intend to pursue them, starting with the largest savings. We will iterate through these incrementally, prioritizing by effort-to-impact ratio so small-effort "quick wins" can be delivered in parallel with larger structural changes. A concrete priority order will be proposed in follow-up MRs. Detailed analysis for each opportunity follows in the per-table sections. | Opportunity | Table | Effort | Savings | |---|---|---|---| | Clear `note_html` for stale MRs | `notes` | Large | 1,000 GB | | Decompose system notes | `notes` | Large | 800 GB | | Convert position columns to structured table | `notes` | Large | 200 GB | | Clear `description_html` and `title_html` for stale MRs | `merge_requests` | Large | 150 GB | | Retention policy on `merge_request_diffs` | `merge_request_diffs`, `merge_request_diff_commits`, `merge_request_diff_files` | Large | TBD (expected large) | | Reclaim bloat (`pg_repack`) | `merge_requests` | Small | 123 GB | | Convert SHA columns to `bytea` | `merge_request_diffs` | Small | 78 GB | | Drop redundant noteable index | `notes` | Small | 63 GB | | Drop `external_diff` column and index | `merge_request_diffs` | Small | 52 GB | | Drop `updated_at` column | `events` | Small | 34 GB | | Drop/convert `index_notes_on_line_code` | `notes` | Small | 34 GB | | Remove `merge_params` for merged MRs | `merge_requests` | Small | 25 GB | | Convert `index_notes_on_organization_id` to partial | `notes` | Small | 19 GB | | Convert SHA columns to `bytea` | `merge_requests` | Small | 15 GB | | Convert integer columns to smaller types | `merge_request_diffs` | Small | 10 GB | | Convert `merge_status` to `smallint` | `merge_requests` | Small | 3.5 GB | | Drop `assignee_id` column and index | `merge_requests` | Small | ~2.7 GB | | **Total** | | | **~2,604 GB** | **Retention policy on `merge_request_diffs`.** Discussed in [issue #594843 (comment)](https://gitlab.com/gitlab-org/gitlab/-/issues/594843#note_3194219248). We expect savings to be large because a retention policy on `merge_request_diffs` would also reduce `merge_request_diff_commits` and `merge_request_diff_files`, but this overlaps with the separate epics already addressing those tables ([epic &16385](https://gitlab.com/groups/gitlab-org/-/epics/16385) and [epic &11272](https://gitlab.com/groups/gitlab-org/-/epics/11272)) and needs to be coordinated there. A savings estimate should be produced as part of that coordination. ### Out-of-scope opportunities The following opportunities were identified during the investigation but are not in scope for this design document. Each is documented here for visibility and future follow-up: | Opportunity | Table | Effort | Savings | |---|---|---|---| | 90-day retention policy | `events` | Large | 1,800 GB | | Partition `events` table | `events` | Medium | 0 GB (enabler) | | Merge namespace columns | `events` | Large | 50 GB | | Drop `st_diff` column | `notes` | Medium | 20 GB | | Drop `confidential` column | `notes` | Small | ~0.1 GB | These are out of scope for the following reasons: - **`events` table changes (90-day retention, partitioning, and merging namespace columns).** These approaches are not yet mature enough to commit to in this document. Code Review co-owns the `events` table because it contains MR-related data, but many of the features that rely on this data are owned by other groups, so changes here require cross-team collaboration and further POC work to validate feasibility, impact, and retention semantics. The [retention policy proposal (#571288)](https://gitlab.com/gitlab-org/gitlab/-/issues/571288) is the current starting point for that discussion. - **Drop `st_diff` column.** Requires removing `LegacyDiffNote` handling across the application, which has a broader scope than a column-level change and is better tracked as a separate deprecation. - **Drop `confidential` column.** Savings are negligible (~0.1 GB) and do not justify prioritizing this over larger opportunities. Can be finalized opportunistically alongside other `notes` changes. The following sections provide detailed analysis for each opportunity, organized by table. Each includes context from the [initial spike investigation](https://gitlab.com/gitlab-org/gitlab/-/issues/586185). ### `notes` table (3,054 GB total: 2,246 GB columns + 808 GB indexes) The `notes` table is the third largest table on GitLab.com. The top six columns by size are `note_html`, `note`, `original_position`, `position`, `discussion_id`, and `change_position`. #### Clear `note_html` for stale merge requests - **Estimated savings:** 1,000 GB (44.5% of `notes` table) - **Effort:** Large `note_html` is a cached rendered version of the `note` field, generated by `CacheMarkdownField`. It consumes 1,169 GB (52% of the table). For notes on merge requests that have not been accessed recently, this cached value can be cleared and regenerated on demand. The approach: 1. Define "stale" criteria: for merged MRs, for example `updated_at` older than 3 months; for open MRs, `updated_at` older than 6 months. 1. Run an async worker to set `note_html` to `NULL` for notes belonging to stale MRs. 1. On read, if `note_html` is `NULL`, regenerate from `note` and persist back to the database. The existing `CacheMarkdownField` module already supports this pattern through `cached_markdown_version`. 1. Benchmark the performance impact of on-the-fly regeneration for API endpoints that return many notes (for example, merge request discussions API). There have been past markdown cache version bumps (for example, `492f0853`, `e7a98807`) which essentially trigger the same regeneration, suggesting the performance impact is manageable. This pattern can later be applied to `description_html` and `title_html` on `merge_requests`, `issues`, and `work_items` tables. Nicolas Dular from the Plan group has expressed support for this approach and noted it could also benefit the `issues` table. #### Decompose system notes - **Estimated savings:** 800 GB (35.6% of `notes` table) - **Effort:** Large Approximately 79% of notes rows are system-generated notes. Unlike user-authored notes, system notes are mostly structured data rendered into full text (for example, "added 3 commits", "changed the description", "mentioned in !1234"). Instead of storing the full rendered text, we can store only the structured parameters needed to reconstruct the message on the fly (for example, action type, count, reference). There are two possible approaches: - **Decompose into a new table.** Move system notes into a dedicated `system_notes` table with structured columns for each action type. This reduces the effective size of both the original and new tables, improving query performance for each access pattern. - **Add structured columns to the existing table.** Add columns for the structured parameters and clear the `note` and `note_html` text fields for system notes, avoiding the complexity of a table split while still reclaiming the storage. #### Convert `position`, `original_position`, and `change_position` to a structured table - **Estimated savings:** 200 GB (3.9% of `notes` table) - **Effort:** Large These three columns store YAML-serialized position data for `DiffNote` records. Currently each field is a YAML string consuming approximately 520 bytes per row. Converting to a structured table (similar to the existing `DiffNotePosition` model and its `diff_note_positions` table) reduces storage to approximately 350 bytes per row. Additionally, `position` holds the exact same data as `original_position` for approximately 2.28% of rows. This redundancy can be eliminated. We investigated whether converting from YAML strings to `jsonb` would help, but YAML strings actually use less space than `jsonb` due to TOAST compression. The structured table approach provides the best savings. #### Drop redundant `noteable_id`/`noteable_type`/`system` index - **Estimated savings:** 63 GB (2.1% of `notes` table) - **Effort:** Small The composite index `index_notes_on_noteable_id_and_noteable_type_and_system` is 63 GB and has minimal usage. We need to evaluate whether queries can be served by other existing indexes before removal. #### Drop or convert `index_notes_on_line_code` to partial - **Estimated savings:** 34 GB (1.1% of `notes` table) - **Effort:** Small This index is 36 GB. Grafana metrics show it is seldom used (fewer than 0.8 scans per second with occasional spikes). The only usage found is for `LegacyDiffNote`, which is a legacy implementation (new diff notes are of type `DiffNote`, and mainly `import` still creates the legacy type). This index can be converted to a partial index or dropped after confirming no active query paths depend on it. #### Drop `st_diff` column and remove `LegacyDiffNote` type - **Estimated savings:** 20 GB (0.7% of `notes` table) - **Effort:** Medium The `st_diff` field is only used by `LegacyDiffNote`. We can standardize the legacy notes type and remove this column, saving approximately 20 GB. #### Convert `index_notes_on_organization_id` to partial - **Estimated savings:** 19 GB (0.6% of `notes` table) - **Effort:** Small This index is 19 GB but has near-zero usage over the past 90 days. The column `organization_id` is almost entirely `NULL` (only 451 KB of actual data). There is no actual usage in application code; the index exists solely for the Cells/Organization sharding initiative. Converting to a partial index on `WHERE organization_id IS NOT NULL` shrinks the index from 19 GB to near zero. This requires confirmation from the Tenant Scale team. #### Drop `confidential` column (migrated to `internal`) - **Estimated savings:** ~0.1 GB - **Effort:** Small The `confidential` column is a duplicate of `internal` after the migration in [Rename confidential column in notes tables (#367923)](https://gitlab.com/gitlab-org/gitlab/-/issues/367923). Finalize the migration by dropping the column and the associated `index_notes_on_id_where_confidential` index (22 MB). #### Total estimated savings for `notes`: ~1,700 GB (56%) ### `merge_requests` table (787 GB total: 551 GB columns + 254 GB indexes) The `merge_requests` table total size is 804 GB (including ~123 GB of reclaimable bloat). The top columns by size are `description` and `description_html` (251 GB, 45.6%), `title` and `title_html` (41 GB, 7.45%), and `merge_params` (26 GB, 4.73%). #### Clear `description_html` and `title_html` for stale merge requests - **Estimated savings:** 150 GB (19.9% of `merge_requests` table) - **Effort:** Large `description_html` consumes 160 GB and `title_html` consumes 27 GB. Both are `CacheMarkdownField` caches, not sources of truth. Only `title` and `description` are the source of truth. The same approach described for `note_html` above applies here. #### Reclaim table bloat (`pg_repack`) - **Estimated savings:** 123 GB (18.8% of `merge_requests` physical size) - **Effort:** Small (requires DB team coordination) Analysis shows a 151 GB difference between the physical table size (551 GB) and the actual column data (400 GB). This is attributed to: table bloat (~123 GB of reclaimable dead tuples), row metadata (~11 GB), and alignment padding and page headers (~17 GB). The bloat is likely caused by the bigint migration and description updates. Running `pg_repack` or `VACUUM FULL` can reclaim this space, coordinated with the Database team for production execution. #### Remove `merge_params` for merged merge requests - **Estimated savings:** 25 GB (3.8% of `merge_requests` table) - **Effort:** Small `merge_params` contains highly repetitive data. For example, `force_remove_source_branch: '0'` is the default behavior for any MR, yet it is persisted for every row. After an MR is merged, `merge_params` is no longer needed by the application. Most of this data is also available in Gitaly if needed later. Discussion with the Code Review backend team on Slack confirmed there is no known usage of `merge_params` after the MR is merged. We can run an async worker daily or weekly to clear `merge_params` for MRs merged more than 7 days ago, reducing the column from 26 GB to under 1 GB. The `merge_params` field could also be converted to `jsonb` if we choose not to decompose it into a separate table. #### Convert SHA columns to `bytea` - **Estimated savings:** 15 GB (1.9% of `merge_requests` table) - **Effort:** Small Three SHA fields are stored as `character varying`, which uses a hex-encoded text representation: - `squash_commit_sha`: varchar field takes 42 bytes, `bytea` takes 20 bytes. 65.7M rows x 22 bytes saved = 1.31 GB. - `merge_commit_sha`: varchar field takes 42 bytes, `bytea` takes 20 bytes. 272.3M rows x 22 bytes saved = 5.99 GB. - `merged_commit_sha`: varchar field takes 82 bytes (double-encoded), `bytea` takes 20 bytes. 121.4M rows x 62 bytes saved = 7.53 GB. This standardizes SHA storage. The existing `in_progress_merge_commit_sha` column already uses `bytea`, so there is precedent in this table. #### Convert `merge_status` from `varchar` to `smallint` - **Estimated savings:** 3.5 GB (0.5% of `merge_requests` table) - **Effort:** Small `merge_status` is defined as `character varying(510)` but only stores 7 possible enum values (`unchecked`, `preparing`, `checking`, `can_be_merged`, `cannot_be_merged`, `cannot_be_merged_recheck`, `cannot_be_merged_rechecking`). Each row consumes approximately 11 bytes. Converting to `smallint` (2 bytes) with a Rails enum mapping saves approximately 10 bytes per row, with no change above the model layer. #### Drop legacy `assignee_id` column - **Estimated savings:** ~2.7 GB (column + index) - **Effort:** Small The `assignee_id` column has been replaced by `merge_request_assignees` association. The column itself is only 43 MB, but the associated index `index_merge_requests_on_assignee_id` is 2.62 GB. Both can be dropped after confirming the deprecation is complete. #### Total estimated savings for `merge_requests`: ~311.5 GB (47.6%) ### `merge_request_diffs` table (389 GB total: 203 GB columns + 186 GB indexes) #### Drop `external_diff` column and index - **Estimated savings:** 52 GB (13.4% of `merge_request_diffs` table) - **Effort:** Small The `external_diff` column is no longer populated. It was computed on the Carrierwave side. The column and its associated index `index_merge_request_diffs_on_external_diff` (14 GB) can be removed, saving approximately 52 GB total. #### Convert 3 SHA columns to `bytea` - **Estimated savings:** 78 GB (20.1% of `merge_request_diffs` table) - **Effort:** Small `base_commit_sha`, `start_commit_sha`, and `head_commit_sha` can be converted from `character varying` to `bytea`, following the same approach as the `merge_requests` SHA columns. Each index on these columns also shrinks by approximately one-third. #### Convert `real_size`, `state`, `external_diff_store`, and `commits_count` to smaller integer types - **Estimated savings:** 10 GB (2.6% of `merge_request_diffs` table) - **Effort:** Small These columns currently use larger integer types than necessary. Converting to 1-byte or 2-byte integers where the value range permits saves approximately 10 GB. #### Total estimated savings for `merge_request_diffs`: ~140 GB (36%) ### `events` table (2,371 GB) The `events` table has a clean schema with limited optimization potential at the column or index level. The table definition is well-designed, and the content looks well-structured in terms of what events we store. The primary savings opportunities come from data lifecycle management. #### Drop `updated_at` column - **Estimated savings:** 34 GB (1.4% of `events` table) - **Effort:** Small Events are append-only and immutable. Analysis shows only 0.02% of rows have different `created_at` and `updated_at` values, and most of those differ by only nanoseconds or milliseconds. Deeper investigation by Abdul Wadood confirmed that rows where `updated_at` and `created_at` differ by more than 10 seconds have not occurred in the last year (the last such rows are from 2024). There is no index on `updated_at`, which implies it is not actively used for queries. However, as Shane Maglangit noted, absence of an index does not definitively prove the column is unused (for example, `namespaces.updated_at` is heavily used but has no index). We should perform a double-check of application code before action. If needed for backward compatibility, we can alias `updated_at` to `created_at` in Rails. #### Merge `project_id`, `group_id`, and `personal_namespace_id` into `namespace_id` - **Estimated savings:** 50 GB (2.1% of `events` table) - **Effort:** Large The events table stores three separate columns for the owning entity. These could potentially be merged into a single `namespace_id` column. However, merging these columns would require joining with the `namespaces` table to find events (since `projects` is a different table from `namespaces`), which would slow down already slow event queries. This needs careful benchmarking before proceeding. #### 90-day retention policy - **Estimated savings:** 1,800 GB (75.9% of `events` table) - **Effort:** Large This is the single largest opportunity across all tables. Both GitHub and Azure DevOps offer 90-day event retention. A similar policy would dramatically reduce the `events` table size. Christina Lohr (`@lohrc`) has [recommended a 90-day retention period](https://gitlab.com/gitlab-org/gitlab/-/issues/571288), focused on the fact that `events` is essentially a duplicate of data that can be found in or reconstructed from other tables. This strategy requires: 1. Product input on acceptable retention periods. 1. Evaluation of whether historical events can be reconstructed from other data sources. Notably, `push_event_payloads` is a highly-entangled table that may not be reconstructable, and we need to consider whether we can delete events for open issues and MRs. 1. Time-based partitioning of the `events` table to enable efficient partition dropping rather than row-by-row deletion (the time-decay data pattern describes this approach in detail). 1. Archiving events older than 90 days to object storage or a data warehouse for compliance and analytics use cases. One additional idea: identify bot and automation actions and treat them differently (fewer fields to store, or not saving to DB at all). This could significantly reduce the table's growth rate, especially as AI tooling increases record creation. This needs Product input first. #### Table partitioning - **Estimated savings:** 0 GB direct (enables retention and improves maintenance) - **Effort:** Medium Even without a retention policy, partitioning the `events` table by `created_at` improves query performance, vacuum efficiency, and maintenance operations. It is a prerequisite for implementing an efficient retention policy. Kerri Miller noted that a blend of approaches will serve best in the long term, and that partitioning might be appropriate even with a short retention period, as we can use time-based partitions and drop the oldest one every 90 days. #### Total estimated savings for `events`: ~1,884 GB (if retention is adopted) ## Design and implementation details TBD ## Alternative Solutions TBD Loading
content/handbook/engineering/architecture/design-documents/code_review_database_size_reduction/_index.md 0 → 100644 +529 −0 Original line number Diff line number Diff line --- title: "Code Review database size reduction for GitLab.com" status: proposed creation-date: "2026-04-15" authors: [ "@zhaochen.li" ] coaches: [ ] dris: [ "@francoisrose", "@phikai", "@patrickbajao", "@dskim_gitlab" ] owning-stage: "~devops::create" participating-stages: [] toc_hide: true --- {{< engineering/design-document-header >}} ## Summary The Code Review group owns several of the largest tables on GitLab.com, including `notes` (3,054 GB), `events` (2,371 GB), `merge_requests` (787 GB), and `merge_request_diffs` (389 GB). Together these tables consume over 6,600 GB of primary database storage and continue to grow, particularly as AI tooling accelerates record creation. This document proposes approaches to reduce the on-disk size of these tables by approximately 50% through a combination of strategies: clearing cached HTML fields for stale records, converting column types to more compact representations, removing redundant columns and indexes, decomposing position data into structured tables, enforcing retention policies, and reclaiming table bloat. The strategies are prioritized by estimated savings, implementation effort, and risk. Related: [Blueprint for Code Review database size reduction (&20233)](https://gitlab.com/groups/gitlab-org/-/epics/20233), [Code Review database size reduction (#17571)](https://gitlab.com/groups/gitlab-org/-/work_items/17571), [Initial spike for database size reduction blueprint (#586185)](https://gitlab.com/gitlab-org/gitlab/-/issues/586185). ## Motivation Large tables on GitLab.com are a major problem for both operations and development. As tables grow beyond hundreds of gigabytes, several problems compound: 1. **Query performance suffers.** Larger tables increase index sizes, slow down sequential scans, and reduce buffer cache hit rates. 1. **Table maintenance becomes expensive.** `VACUUM`, `ANALYZE`, and index rebuilds take longer and hold locks that affect application availability. 1. **Infrastructure costs increase.** Storage, I/O, replication lag, and backup times all scale with on-disk size. 1. **Data migrations become complex.** Schema changes on large tables require significantly more effort to implement and are more likely to cause stability problems on GitLab.com. For example, swapping bigint columns with integer columns on `merge_requests` had to be split into 3 stages and took several months to complete ([#507695](https://gitlab.com/gitlab-org/gitlab/-/work_items/507695)). 1. **Operational risk grows.** Failovers and disaster recovery become slower and more fragile as data volume increases. The [Database Scalability blueprint](/handbook/engineering/architecture/design-documents/database_size_limits/) (June 2021) established a target of keeping individual physical tables under 100 GB on GitLab.com. Nearly five years later, multiple Code Review tables on GitLab.com still exceed this threshold by 7x to 30x and have grown significantly since the original analysis. Without intervention, these tables will continue to grow as GitLab.com usage increases. The tables over 100 GB on GitLab.com owned by or closely related to Code Review, as of January 2026, are: | Table | Size | |---|---| | `merge_request_diff_commits` | 7,875 GB | | `merge_request_diff_files` | 3,290 GB | | `notes` | 3,156 GB | | `events` | 2,371 GB | | `merge_requests` | 787 GB | | `merge_request_diffs` | 451 GB | | `note_diff_files` | 170 GB | | `approval_merge_request_rules_users` | 160 GB | | `merge_request_metrics` | 140 GB | This document focuses on the remaining large tables after excluding the items listed in [Non-Goals](#non-goals) below: `notes`, `events`, `merge_requests`, and `merge_request_diffs`. ### Goals 1. Reduce the combined on-disk size of `notes`, `merge_requests`, `merge_request_diffs`, and `events` tables by approximately 50%. 1. Achieve the largest savings with the lowest-risk changes first (quick wins), then progress to larger structural changes. 1. Maintain backward compatibility with existing application behavior with no user-facing feature regressions. 1. Establish repeatable patterns (for example, HTML cache clearing, retention policies) that other groups can adopt for their own large tables such as `issues` and `work_items`. 1. Update application and model code when needed to support larger structural changes (for example, table decomposition, column type conversions). 1. Deliver changes incrementally across multiple milestones, with each change independently valuable. ### Non-Goals This document does not cover the two largest Code Review tables, `merge_request_diff_commits` and `merge_request_diff_files`. Those tables are already being addressed by separate epics: - [Reduce the growth and size of the merge_request_diff_commits (&16385)](https://gitlab.com/groups/gitlab-org/-/epics/16385) - [Partition and reduce size of merge_request_diff_files (&11272)](https://gitlab.com/groups/gitlab-org/-/epics/11272) See [Out-of-scope opportunities](#out-of-scope-opportunities) below for other opportunities identified during the investigation that are not initially in scope. ## Proposal The table below summarizes all in-scope opportunities in the order we intend to pursue them, starting with the largest savings. We will iterate through these incrementally, prioritizing by effort-to-impact ratio so small-effort "quick wins" can be delivered in parallel with larger structural changes. A concrete priority order will be proposed in follow-up MRs. Detailed analysis for each opportunity follows in the per-table sections. | Opportunity | Table | Effort | Savings | |---|---|---|---| | Clear `note_html` for stale MRs | `notes` | Large | 1,000 GB | | Decompose system notes | `notes` | Large | 800 GB | | Convert position columns to structured table | `notes` | Large | 200 GB | | Clear `description_html` and `title_html` for stale MRs | `merge_requests` | Large | 150 GB | | Retention policy on `merge_request_diffs` | `merge_request_diffs`, `merge_request_diff_commits`, `merge_request_diff_files` | Large | TBD (expected large) | | Reclaim bloat (`pg_repack`) | `merge_requests` | Small | 123 GB | | Convert SHA columns to `bytea` | `merge_request_diffs` | Small | 78 GB | | Drop redundant noteable index | `notes` | Small | 63 GB | | Drop `external_diff` column and index | `merge_request_diffs` | Small | 52 GB | | Drop `updated_at` column | `events` | Small | 34 GB | | Drop/convert `index_notes_on_line_code` | `notes` | Small | 34 GB | | Remove `merge_params` for merged MRs | `merge_requests` | Small | 25 GB | | Convert `index_notes_on_organization_id` to partial | `notes` | Small | 19 GB | | Convert SHA columns to `bytea` | `merge_requests` | Small | 15 GB | | Convert integer columns to smaller types | `merge_request_diffs` | Small | 10 GB | | Convert `merge_status` to `smallint` | `merge_requests` | Small | 3.5 GB | | Drop `assignee_id` column and index | `merge_requests` | Small | ~2.7 GB | | **Total** | | | **~2,604 GB** | **Retention policy on `merge_request_diffs`.** Discussed in [issue #594843 (comment)](https://gitlab.com/gitlab-org/gitlab/-/issues/594843#note_3194219248). We expect savings to be large because a retention policy on `merge_request_diffs` would also reduce `merge_request_diff_commits` and `merge_request_diff_files`, but this overlaps with the separate epics already addressing those tables ([epic &16385](https://gitlab.com/groups/gitlab-org/-/epics/16385) and [epic &11272](https://gitlab.com/groups/gitlab-org/-/epics/11272)) and needs to be coordinated there. A savings estimate should be produced as part of that coordination. ### Out-of-scope opportunities The following opportunities were identified during the investigation but are not in scope for this design document. Each is documented here for visibility and future follow-up: | Opportunity | Table | Effort | Savings | |---|---|---|---| | 90-day retention policy | `events` | Large | 1,800 GB | | Partition `events` table | `events` | Medium | 0 GB (enabler) | | Merge namespace columns | `events` | Large | 50 GB | | Drop `st_diff` column | `notes` | Medium | 20 GB | | Drop `confidential` column | `notes` | Small | ~0.1 GB | These are out of scope for the following reasons: - **`events` table changes (90-day retention, partitioning, and merging namespace columns).** These approaches are not yet mature enough to commit to in this document. Code Review co-owns the `events` table because it contains MR-related data, but many of the features that rely on this data are owned by other groups, so changes here require cross-team collaboration and further POC work to validate feasibility, impact, and retention semantics. The [retention policy proposal (#571288)](https://gitlab.com/gitlab-org/gitlab/-/issues/571288) is the current starting point for that discussion. - **Drop `st_diff` column.** Requires removing `LegacyDiffNote` handling across the application, which has a broader scope than a column-level change and is better tracked as a separate deprecation. - **Drop `confidential` column.** Savings are negligible (~0.1 GB) and do not justify prioritizing this over larger opportunities. Can be finalized opportunistically alongside other `notes` changes. The following sections provide detailed analysis for each opportunity, organized by table. Each includes context from the [initial spike investigation](https://gitlab.com/gitlab-org/gitlab/-/issues/586185). ### `notes` table (3,054 GB total: 2,246 GB columns + 808 GB indexes) The `notes` table is the third largest table on GitLab.com. The top six columns by size are `note_html`, `note`, `original_position`, `position`, `discussion_id`, and `change_position`. #### Clear `note_html` for stale merge requests - **Estimated savings:** 1,000 GB (44.5% of `notes` table) - **Effort:** Large `note_html` is a cached rendered version of the `note` field, generated by `CacheMarkdownField`. It consumes 1,169 GB (52% of the table). For notes on merge requests that have not been accessed recently, this cached value can be cleared and regenerated on demand. The approach: 1. Define "stale" criteria: for merged MRs, for example `updated_at` older than 3 months; for open MRs, `updated_at` older than 6 months. 1. Run an async worker to set `note_html` to `NULL` for notes belonging to stale MRs. 1. On read, if `note_html` is `NULL`, regenerate from `note` and persist back to the database. The existing `CacheMarkdownField` module already supports this pattern through `cached_markdown_version`. 1. Benchmark the performance impact of on-the-fly regeneration for API endpoints that return many notes (for example, merge request discussions API). There have been past markdown cache version bumps (for example, `492f0853`, `e7a98807`) which essentially trigger the same regeneration, suggesting the performance impact is manageable. This pattern can later be applied to `description_html` and `title_html` on `merge_requests`, `issues`, and `work_items` tables. Nicolas Dular from the Plan group has expressed support for this approach and noted it could also benefit the `issues` table. #### Decompose system notes - **Estimated savings:** 800 GB (35.6% of `notes` table) - **Effort:** Large Approximately 79% of notes rows are system-generated notes. Unlike user-authored notes, system notes are mostly structured data rendered into full text (for example, "added 3 commits", "changed the description", "mentioned in !1234"). Instead of storing the full rendered text, we can store only the structured parameters needed to reconstruct the message on the fly (for example, action type, count, reference). There are two possible approaches: - **Decompose into a new table.** Move system notes into a dedicated `system_notes` table with structured columns for each action type. This reduces the effective size of both the original and new tables, improving query performance for each access pattern. - **Add structured columns to the existing table.** Add columns for the structured parameters and clear the `note` and `note_html` text fields for system notes, avoiding the complexity of a table split while still reclaiming the storage. #### Convert `position`, `original_position`, and `change_position` to a structured table - **Estimated savings:** 200 GB (3.9% of `notes` table) - **Effort:** Large These three columns store YAML-serialized position data for `DiffNote` records. Currently each field is a YAML string consuming approximately 520 bytes per row. Converting to a structured table (similar to the existing `DiffNotePosition` model and its `diff_note_positions` table) reduces storage to approximately 350 bytes per row. Additionally, `position` holds the exact same data as `original_position` for approximately 2.28% of rows. This redundancy can be eliminated. We investigated whether converting from YAML strings to `jsonb` would help, but YAML strings actually use less space than `jsonb` due to TOAST compression. The structured table approach provides the best savings. #### Drop redundant `noteable_id`/`noteable_type`/`system` index - **Estimated savings:** 63 GB (2.1% of `notes` table) - **Effort:** Small The composite index `index_notes_on_noteable_id_and_noteable_type_and_system` is 63 GB and has minimal usage. We need to evaluate whether queries can be served by other existing indexes before removal. #### Drop or convert `index_notes_on_line_code` to partial - **Estimated savings:** 34 GB (1.1% of `notes` table) - **Effort:** Small This index is 36 GB. Grafana metrics show it is seldom used (fewer than 0.8 scans per second with occasional spikes). The only usage found is for `LegacyDiffNote`, which is a legacy implementation (new diff notes are of type `DiffNote`, and mainly `import` still creates the legacy type). This index can be converted to a partial index or dropped after confirming no active query paths depend on it. #### Drop `st_diff` column and remove `LegacyDiffNote` type - **Estimated savings:** 20 GB (0.7% of `notes` table) - **Effort:** Medium The `st_diff` field is only used by `LegacyDiffNote`. We can standardize the legacy notes type and remove this column, saving approximately 20 GB. #### Convert `index_notes_on_organization_id` to partial - **Estimated savings:** 19 GB (0.6% of `notes` table) - **Effort:** Small This index is 19 GB but has near-zero usage over the past 90 days. The column `organization_id` is almost entirely `NULL` (only 451 KB of actual data). There is no actual usage in application code; the index exists solely for the Cells/Organization sharding initiative. Converting to a partial index on `WHERE organization_id IS NOT NULL` shrinks the index from 19 GB to near zero. This requires confirmation from the Tenant Scale team. #### Drop `confidential` column (migrated to `internal`) - **Estimated savings:** ~0.1 GB - **Effort:** Small The `confidential` column is a duplicate of `internal` after the migration in [Rename confidential column in notes tables (#367923)](https://gitlab.com/gitlab-org/gitlab/-/issues/367923). Finalize the migration by dropping the column and the associated `index_notes_on_id_where_confidential` index (22 MB). #### Total estimated savings for `notes`: ~1,700 GB (56%) ### `merge_requests` table (787 GB total: 551 GB columns + 254 GB indexes) The `merge_requests` table total size is 804 GB (including ~123 GB of reclaimable bloat). The top columns by size are `description` and `description_html` (251 GB, 45.6%), `title` and `title_html` (41 GB, 7.45%), and `merge_params` (26 GB, 4.73%). #### Clear `description_html` and `title_html` for stale merge requests - **Estimated savings:** 150 GB (19.9% of `merge_requests` table) - **Effort:** Large `description_html` consumes 160 GB and `title_html` consumes 27 GB. Both are `CacheMarkdownField` caches, not sources of truth. Only `title` and `description` are the source of truth. The same approach described for `note_html` above applies here. #### Reclaim table bloat (`pg_repack`) - **Estimated savings:** 123 GB (18.8% of `merge_requests` physical size) - **Effort:** Small (requires DB team coordination) Analysis shows a 151 GB difference between the physical table size (551 GB) and the actual column data (400 GB). This is attributed to: table bloat (~123 GB of reclaimable dead tuples), row metadata (~11 GB), and alignment padding and page headers (~17 GB). The bloat is likely caused by the bigint migration and description updates. Running `pg_repack` or `VACUUM FULL` can reclaim this space, coordinated with the Database team for production execution. #### Remove `merge_params` for merged merge requests - **Estimated savings:** 25 GB (3.8% of `merge_requests` table) - **Effort:** Small `merge_params` contains highly repetitive data. For example, `force_remove_source_branch: '0'` is the default behavior for any MR, yet it is persisted for every row. After an MR is merged, `merge_params` is no longer needed by the application. Most of this data is also available in Gitaly if needed later. Discussion with the Code Review backend team on Slack confirmed there is no known usage of `merge_params` after the MR is merged. We can run an async worker daily or weekly to clear `merge_params` for MRs merged more than 7 days ago, reducing the column from 26 GB to under 1 GB. The `merge_params` field could also be converted to `jsonb` if we choose not to decompose it into a separate table. #### Convert SHA columns to `bytea` - **Estimated savings:** 15 GB (1.9% of `merge_requests` table) - **Effort:** Small Three SHA fields are stored as `character varying`, which uses a hex-encoded text representation: - `squash_commit_sha`: varchar field takes 42 bytes, `bytea` takes 20 bytes. 65.7M rows x 22 bytes saved = 1.31 GB. - `merge_commit_sha`: varchar field takes 42 bytes, `bytea` takes 20 bytes. 272.3M rows x 22 bytes saved = 5.99 GB. - `merged_commit_sha`: varchar field takes 82 bytes (double-encoded), `bytea` takes 20 bytes. 121.4M rows x 62 bytes saved = 7.53 GB. This standardizes SHA storage. The existing `in_progress_merge_commit_sha` column already uses `bytea`, so there is precedent in this table. #### Convert `merge_status` from `varchar` to `smallint` - **Estimated savings:** 3.5 GB (0.5% of `merge_requests` table) - **Effort:** Small `merge_status` is defined as `character varying(510)` but only stores 7 possible enum values (`unchecked`, `preparing`, `checking`, `can_be_merged`, `cannot_be_merged`, `cannot_be_merged_recheck`, `cannot_be_merged_rechecking`). Each row consumes approximately 11 bytes. Converting to `smallint` (2 bytes) with a Rails enum mapping saves approximately 10 bytes per row, with no change above the model layer. #### Drop legacy `assignee_id` column - **Estimated savings:** ~2.7 GB (column + index) - **Effort:** Small The `assignee_id` column has been replaced by `merge_request_assignees` association. The column itself is only 43 MB, but the associated index `index_merge_requests_on_assignee_id` is 2.62 GB. Both can be dropped after confirming the deprecation is complete. #### Total estimated savings for `merge_requests`: ~311.5 GB (47.6%) ### `merge_request_diffs` table (389 GB total: 203 GB columns + 186 GB indexes) #### Drop `external_diff` column and index - **Estimated savings:** 52 GB (13.4% of `merge_request_diffs` table) - **Effort:** Small The `external_diff` column is no longer populated. It was computed on the Carrierwave side. The column and its associated index `index_merge_request_diffs_on_external_diff` (14 GB) can be removed, saving approximately 52 GB total. #### Convert 3 SHA columns to `bytea` - **Estimated savings:** 78 GB (20.1% of `merge_request_diffs` table) - **Effort:** Small `base_commit_sha`, `start_commit_sha`, and `head_commit_sha` can be converted from `character varying` to `bytea`, following the same approach as the `merge_requests` SHA columns. Each index on these columns also shrinks by approximately one-third. #### Convert `real_size`, `state`, `external_diff_store`, and `commits_count` to smaller integer types - **Estimated savings:** 10 GB (2.6% of `merge_request_diffs` table) - **Effort:** Small These columns currently use larger integer types than necessary. Converting to 1-byte or 2-byte integers where the value range permits saves approximately 10 GB. #### Total estimated savings for `merge_request_diffs`: ~140 GB (36%) ### `events` table (2,371 GB) The `events` table has a clean schema with limited optimization potential at the column or index level. The table definition is well-designed, and the content looks well-structured in terms of what events we store. The primary savings opportunities come from data lifecycle management. #### Drop `updated_at` column - **Estimated savings:** 34 GB (1.4% of `events` table) - **Effort:** Small Events are append-only and immutable. Analysis shows only 0.02% of rows have different `created_at` and `updated_at` values, and most of those differ by only nanoseconds or milliseconds. Deeper investigation by Abdul Wadood confirmed that rows where `updated_at` and `created_at` differ by more than 10 seconds have not occurred in the last year (the last such rows are from 2024). There is no index on `updated_at`, which implies it is not actively used for queries. However, as Shane Maglangit noted, absence of an index does not definitively prove the column is unused (for example, `namespaces.updated_at` is heavily used but has no index). We should perform a double-check of application code before action. If needed for backward compatibility, we can alias `updated_at` to `created_at` in Rails. #### Merge `project_id`, `group_id`, and `personal_namespace_id` into `namespace_id` - **Estimated savings:** 50 GB (2.1% of `events` table) - **Effort:** Large The events table stores three separate columns for the owning entity. These could potentially be merged into a single `namespace_id` column. However, merging these columns would require joining with the `namespaces` table to find events (since `projects` is a different table from `namespaces`), which would slow down already slow event queries. This needs careful benchmarking before proceeding. #### 90-day retention policy - **Estimated savings:** 1,800 GB (75.9% of `events` table) - **Effort:** Large This is the single largest opportunity across all tables. Both GitHub and Azure DevOps offer 90-day event retention. A similar policy would dramatically reduce the `events` table size. Christina Lohr (`@lohrc`) has [recommended a 90-day retention period](https://gitlab.com/gitlab-org/gitlab/-/issues/571288), focused on the fact that `events` is essentially a duplicate of data that can be found in or reconstructed from other tables. This strategy requires: 1. Product input on acceptable retention periods. 1. Evaluation of whether historical events can be reconstructed from other data sources. Notably, `push_event_payloads` is a highly-entangled table that may not be reconstructable, and we need to consider whether we can delete events for open issues and MRs. 1. Time-based partitioning of the `events` table to enable efficient partition dropping rather than row-by-row deletion (the time-decay data pattern describes this approach in detail). 1. Archiving events older than 90 days to object storage or a data warehouse for compliance and analytics use cases. One additional idea: identify bot and automation actions and treat them differently (fewer fields to store, or not saving to DB at all). This could significantly reduce the table's growth rate, especially as AI tooling increases record creation. This needs Product input first. #### Table partitioning - **Estimated savings:** 0 GB direct (enables retention and improves maintenance) - **Effort:** Medium Even without a retention policy, partitioning the `events` table by `created_at` improves query performance, vacuum efficiency, and maintenance operations. It is a prerequisite for implementing an efficient retention policy. Kerri Miller noted that a blend of approaches will serve best in the long term, and that partitioning might be appropriate even with a short retention period, as we can use time-based partitions and drop the oldest one every 90 days. #### Total estimated savings for `events`: ~1,884 GB (if retention is adopted) ## Design and implementation details TBD ## Alternative Solutions TBD