You need to sign in or sign up before continuing.

Solution for large table maintenance

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

Description

The groupauthentication team is currently working on table size maintenance for 3 large, growing tables:

The oauth_access_tokens and oauth_access_grants tables have a large number of "revoked" rows, that can be safely deleted. The plan is to add a cron job that periodically deletes the unneeded rows (first moving to a separate archival table and then deleting from the archival table after a specific time period). Just deleting the rows will not reclaim the space though, it will only prevent further growth and these tables don't lend themselves well to partitioning since they are mostly queried by individual text columns (token and refresh_token).

Following is the plan to handle this cleanup, based on our discussion so far.

Note: part of this plan can change (phase 2 - 4), according to the pg_repack implementation.

Policies

Authentication_events: 1 year worth of events
OAuth access grants: Active grants and 1 month of revoked grants
OAuth access token: Active oauth access tokens, active refresh tokens, 1 month of revoked tokens.

Expected Results

3 temporary archive tables (new) - persisted for 2 or 3 milestones as soft-deletion purposes
- authentication_events_archived_records
- oauth_access_grants_archived_records
- oauth_access_tokens_archived_records
3 cleanup workers (new)
3 feature flags (new) for cleanup workers:
- archive_revoked_authentication_events
- archive_revoked_access_grants
- archive_revoked_access_tokens
3 temporary clean tables (new), if pg_repack is not viable:
- authentication_events_clean
- oauth_access_grants_clean
- oauth_access_tokens_clean
3 original tables reclaimed (either via pg_repack or table swap):
- authentication_events
- oauth_access_grants
- oauth_access_tokens

Phase 1: Archive Infrastructure & Initial Cleanup (18.4 - 18.5)

Scope

Create archive infrastructure and begin actual cleanup on GitLab.com

Actions

authentication_events (beginning of 18.4):
- Create authentication_events_archived_records, unpartitioned for soft-deletion purposes
- Implement archival worker with batch size of 10,000 and sub-batch size of 1,000
- Enable feature flag on GitLab.com ONLY: archive_revoked_authentication_events: true (GitLab.com), false (self-managed)
- Begin moving revoked records to archive
- Monitor cleanup performance and impact
- Addressed with #545007
OAuth tables (middle of 18.4):
- Same infrastructure setup
- Enable on GitLab.com: archive_revoked_access_grants and archive_revoked_access_tokens as true (GitLab.com), false (self-managed)
- Start actual cleanup process
- Addresses with:
  - #377995 (closed)
  - https://gitlab.com/gitlab-org/gitlab/-/issues/521855

When confident, end of 18.4/18.5:

Enable archival on self-managed:
- Feature flags globally enabled
  - archive_revoked_authentication_events
  - archive_revoked_access_grants
  - archive_revoked_access_tokens

#### Additionally, if pg_repack is viable:

- Work with Database Operations team to schedule pg_repack operations - Test pg_repack on smaller tables first - Prepare monitoring and rollback plans

Consideration

GitLab.com gets immediate benefit: cleanup starts reducing table growth
Self-managed protection: feature flags not enabled globally until proven stable
Team monitoring: team can monitor cleanup runs
Decision point: Evaluate pg_repack progress by end of 18.4 to determine Phase 2 approach

Open questions

How to achieve Monitoring setup:

Prometheus metrics for records archived and job duration
Grafana dashboard for table size trends
Error tracking in Sentry

Phase 2: Space Reclamation Preparation (18.6 - on hold)

Scope

At this point, the original tables are holding the records according to the policy, but the table size on disk is the original one.
Prepare for space reclamation using either pg_repack or clean table approach

Actions

If pg_repack is not viable:

authentication_events (beginning of 18.6):
- Create authentication_events_clean table
- Implement table sync triggers triggers
- Begin backfill of non-revoked records
- Monitor sync accuracy
OAuth tables (middle of 18.6):
- Same process for OAuth tables:
  - oauth_access_grants_clean
  - oauth_access_tokens_clean
Validation period (mid-late 18.6):
- Monitor sync accuracy between original and clean tables
- Verify archival process stability
- Confirm no performance degradation
- Complete backfill process

#### If using pg_repack:

- Execute pg_repack operations on GitLab.com during low-traffic periods - Monitor database performance during and after operations - Discuss the roll out to self-managed (target: next release?)

Consideration

The self-managed cleanup worker must have been activated in the previous milestone to prevent potential workload issues.
Backfill must complete before any swap attempt
Sync validation: continuous monitoring of data consistency
Performance impact: measure and adjust batch sizes

Phase 3: Space Reclamation Execution (18.7)

Scope

At this point, clean and original tables are in sync.
Execute table swap after proven stability and validate.

Actions

If using table swap:

First table swap execution (early 18.7):
- Execute swap on GitLab.com using replace_table helper or dedicated migration on authentication_events_clean
- Monitor for 1 week post-swap
Second table swap execution (mid 18.7):
- Execute swap on GitLab.com on oauth_access_grants_clean and oauth_access_tokens_clean
- Post-swap monitoring
Archived tables cleanup (late 18.7):
- Drop the three archival table:
  - authentication_events_archived_records
  - oauth_access_grants_archived_records
  - oauth_access_tokens_archived_records
- Cleanup related code

If using pg_repack

Then at this point we already have the original tables resized.
Anticipate Phase 4: Key learnings and documentation (18.8)

Consideration

GitLab.com first: validate swap process before self-managed
Testing: Use DBLab clones for migration testing before production
Proven stability on GitLab.com before self-managed rollout

Phase 4: Key learnings and documentation (18.8)

Scope

Extract main key learnings and finalize documentation

Actions

Provide structured documentation to be discussed and published under https://docs.gitlab.com/development/database/

Hiding the timeline for now until we get clarity on pg_repack availability and which other environments to target.

Old timeline

Timeline

Phase	Timing	GitLab.com	Self-Managed	Validation Checkpoints
18.4	Early	Deploy archive infrastructure Enable cleanup for authentication_events	-	Monitor initial cleanup performance
	Mid	Enable cleanup for OAuth tables	-	Verify no performance degradation
	Late	-	-	Confirm cleanup is working as expected
	Release	-	Provide archive infrastructure and enable cleanup for all tables (if GitLab.com stable)
18.5	Early	Evaluate pg_repack viability	-	Validate pg_repack effectiveness
	Early	Deploy `authentication_events_clean` table (if pg_repack not viable) Begin backfill process and sync	-	Monitor backfill and sync performance
	Mid	Deploy `oauth_access_grants_clean` and `oauth_access_tokens_clean` table Begin backfill process and sync		Monitor backfill and sync performance
	Late	Validation period for space reclamation Document approach	Complete preparation	Verify data consistency Confirm sync accuracy Verify no performance issues
	Release	-	Provide clean tables infrastructure and begin backfill process (if pg_repack not viable)
18.6	Early	Execute space reclamation (pg_repack or table swap) on `authentication_events_clean`	-	Monitor post-reclamation stability Verify successful space reclamation
	Mid	Execute space reclamation (pg_repack or table swap) on `oauth_access_grants_clean` and `oauth_access_tokens_clean`	-	Monitor post-reclamation stability Verify successful space reclamation
	Late	Archived tables cleanup and related code	-	Verify successful table drop
	Release		Execute space reclamation (pg_repack or table swap) on all tables. Provide archived tables cleanup
18.7	Early	Refine process based on learnings		Review GitLab.com metrics
	Mid	Extract key learnings Finalize documentation	-	Publish final documentation

Success Factors

Growth control achieved: Feature flags enabled and cleanup workers successfully running
Zero authentication downtime: No disruption to authentication flows during operations
Space reclaimed: Measurable reduction in table sizes after operations
Self-managed compatibility: Solution works for both GitLab.com and self-managed instances
Maintainable solution: Ongoing cleanup processes prevent future bloat
Standardized solution: the approach is documented to benefit the entire company.

Monitoring & Validation

Table size tracking: Monitor size reduction over time
Worker performance: Track job duration and records processed
Application impact: Monitor authentication error rates and performance
Data consistency: Validate data integrity throughout the process

Under discussion:

Space reclamation method: Database team is working on pg_repack. If available and proven viable by end of 18.4, this could be used in Phase 2 (18.5) instead of introducing the clean table to swap, significantly lowering the complexity
Table swap approach: If pg_repack is not viable, need to determine if replace_table helper will work for unpartitioned tables or if a custom swap approach is needed
Monitoring setup: as mentioned in phase 1

Useful resources

Partition strategy: Using Time-range Partitioning
Grafana dashboard to monitor table size: Explore

Previous details are available here

We considered a couple of other approaches:

Create a new table and switch to using that. This would need us to:
- Create a new table
- Batch migrate non-revoked records from the old table to the new table
- Build fallbacks in the code to read from both tables. This means overriding a lot of functions defined in the gem.
- Eventually remove the fallback once all records are migrated
- Drop the old table.
- In parallel, add a cron job that periodically deletes newly revoked records.
Use partitioning with a strategy other than date range. We haven't figured out the detailed steps for this approach yet, it maybe possible with a variant of this approach.

Both of these approaches will involve significant effort, complexity and therefore risk.

Additionally, we chose to use a feature flag backed Sidekiq worker (configured as a cron job) to delete the existing records instead of a batched background migration. Reasons:

We can address the problem for gitlab.com first, which is where the problem is likely to be most severe
We have the ability to quickly turn it off with the feature flag, if things go wrong.
We can schedule the cron job to run at a time when the team is available for monitoring.
Our self managed customers can later benefit from this when we default enable the feature flag and they'd have the option to disable the feature flag if things go wrong.
We don't create duplicate code in a worker and a background migration

We were however recommended the batched background migration approach to delete existing records for the following reasons:

The cleanup would be available to self managed customers as well at the same time as gitalb.com
A batched background migration will run multiple laps around this solution when the database is not busy because it has a dynamic batch size and the worker approach doesn't have one.
We wouldn't need the runtime limiter in the worker if we used a BBM for the existing records

We would appreciate advice on the following questions:

Is running full vacuum to reclaim disk space on these tables an option, knowing that this will need a maintenance window?
Given that this issue will get resolved when the cells architecture is live and we eventually deprecate the legacy cell(this maybe a few years out), can we live with the current table sizes and only prevent further growth?
The new archival tables that we create will also potentially grow to a large size when we delete the existing batch of records, that have never been cleaned up (order of 100s of millions). Given that these tables won't be used for live user requests, is running full vacuum an option for these tables?
Which approach is recommended to delete the existing records, given we want to be able to turn off the deletion with a feature flag and we want to run it at a time when the authentication team is available for monitoring?

Edited Oct 09, 2025 by Daniele Bracciani