Skip to content

Solution for large table maintenance

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Description

The groupauthentication team is currently working on table size maintenance for 3 large, growing tables:

The oauth_access_tokens and oauth_access_grants tables have a large number of "revoked" rows, that can be safely deleted. The plan is to add a cron job that periodically deletes the unneeded rows (first moving to a separate archival table and then deleting from the archival table after a specific time period). Just deleting the rows will not reclaim the space though, it will only prevent further growth and these tables don't lend themselves well to partitioning since they are mostly queried by individual text columns (token and refresh_token).

Following is the plan to handle this cleanup, based on our discussion so far.

Note: part of this plan can change (phase 2 - 4), according to the pg_repack implementation.

Policies

  • Authentication_events: 1 year worth of events
  • OAuth access grants: Active grants and 1 month of revoked grants
  • OAuth access token: Active oauth access tokens, active refresh tokens, 1 month of revoked tokens.

Expected Results

  • 3 temporary archive tables (new) - persisted for 2 or 3 milestones as soft-deletion purposes
    • authentication_events_archived_records
    • oauth_access_grants_archived_records
    • oauth_access_tokens_archived_records
  • 3 cleanup workers (new)
  • 3 feature flags (new) for cleanup workers:
    • archive_revoked_authentication_events
    • archive_revoked_access_grants
    • archive_revoked_access_tokens
  • 3 temporary clean tables (new), if pg_repack is not viable:
    • authentication_events_clean
    • oauth_access_grants_clean
    • oauth_access_tokens_clean
  • 3 original tables reclaimed (either via pg_repack or table swap):
    • authentication_events
    • oauth_access_grants
    • oauth_access_tokens

Phase 1: Archive Infrastructure & Initial Cleanup (18.4 - 18.5)

Scope

Create archive infrastructure and begin actual cleanup on GitLab.com

Actions

  • authentication_events (beginning of 18.4):
    • Create authentication_events_archived_records, unpartitioned for soft-deletion purposes
    • Implement archival worker with batch size of 10,000 and sub-batch size of 1,000
    • Enable feature flag on GitLab.com ONLY: archive_revoked_authentication_events: true (GitLab.com), false (self-managed)
    • Begin moving revoked records to archive
    • Monitor cleanup performance and impact
    • Addressed with #545007
  • OAuth tables (middle of 18.4):

When confident, end of 18.4/18.5:

  • Enable archival on self-managed:
    • Feature flags globally enabled
      • archive_revoked_authentication_events
      • archive_revoked_access_grants
      • archive_revoked_access_tokens
#### Additionally, if pg_repack is viable: - Work with Database Operations team to schedule pg_repack operations - Test pg_repack on smaller tables first - Prepare monitoring and rollback plans

Consideration

  • GitLab.com gets immediate benefit: cleanup starts reducing table growth
  • Self-managed protection: feature flags not enabled globally until proven stable
  • Team monitoring: team can monitor cleanup runs
  • Decision point: Evaluate pg_repack progress by end of 18.4 to determine Phase 2 approach

Open questions

How to achieve Monitoring setup:

  • Prometheus metrics for records archived and job duration
  • Grafana dashboard for table size trends
  • Error tracking in Sentry

Phase 2: Space Reclamation Preparation (18.6 - on hold)

Scope

At this point, the original tables are holding the records according to the policy, but the table size on disk is the original one.
Prepare for space reclamation using either pg_repack or clean table approach

Actions

If pg_repack is not viable:

  • authentication_events (beginning of 18.6):
    • Create authentication_events_clean table
    • Implement table sync triggers triggers
    • Begin backfill of non-revoked records
    • Monitor sync accuracy
  • OAuth tables (middle of 18.6):
    • Same process for OAuth tables:
      • oauth_access_grants_clean
      • oauth_access_tokens_clean
  • Validation period (mid-late 18.6):
    • Monitor sync accuracy between original and clean tables
    • Verify archival process stability
    • Confirm no performance degradation
    • Complete backfill process
#### If using pg_repack: - Execute pg_repack operations on GitLab.com during low-traffic periods - Monitor database performance during and after operations - Discuss the roll out to self-managed (target: next release?)

Consideration

  • The self-managed cleanup worker must have been activated in the previous milestone to prevent potential workload issues.
  • Backfill must complete before any swap attempt
  • Sync validation: continuous monitoring of data consistency
  • Performance impact: measure and adjust batch sizes

Phase 3: Space Reclamation Execution (18.7)

Scope

At this point, clean and original tables are in sync.
Execute table swap after proven stability and validate.

Actions

If using table swap:

  • First table swap execution (early 18.7):
    • Execute swap on GitLab.com using replace_table helper or dedicated migration on authentication_events_clean
    • Monitor for 1 week post-swap
  • Second table swap execution (mid 18.7):
    • Execute swap on GitLab.com on oauth_access_grants_clean and oauth_access_tokens_clean
    • Post-swap monitoring
  • Archived tables cleanup (late 18.7):
    • Drop the three archival table:
      • authentication_events_archived_records
      • oauth_access_grants_archived_records
      • oauth_access_tokens_archived_records
    • Cleanup related code

If using pg_repack

  • Then at this point we already have the original tables resized.
  • Anticipate Phase 4: Key learnings and documentation (18.8)

Consideration

  • GitLab.com first: validate swap process before self-managed
  • Testing: Use DBLab clones for migration testing before production
  • Proven stability on GitLab.com before self-managed rollout

Phase 4: Key learnings and documentation (18.8)

Scope

Extract main key learnings and finalize documentation

Actions


Hiding the timeline for now until we get clarity on pg_repack availability and which other environments to target.

Old timeline

Timeline

Phase Timing GitLab.com Self-Managed Validation Checkpoints
18.4 Early Deploy archive infrastructure
Enable cleanup for authentication_events
- Monitor initial cleanup performance
Mid Enable cleanup for OAuth tables - Verify no performance degradation
Late - - Confirm cleanup is working as expected
Release - Provide archive infrastructure and enable cleanup for all tables (if GitLab.com stable)
18.5 Early Evaluate pg_repack viability - Validate pg_repack effectiveness
Early Deploy authentication_events_clean table (if pg_repack not viable)
Begin backfill process and sync
- Monitor backfill and sync performance
Mid Deploy oauth_access_grants_clean and oauth_access_tokens_clean table
Begin backfill process and sync
Monitor backfill and sync performance
Late Validation period for space reclamation
Document approach
Complete preparation Verify data consistency
Confirm sync accuracy
Verify no performance issues
Release - Provide clean tables infrastructure and
begin backfill process (if pg_repack not viable)
18.6 Early Execute space reclamation
(pg_repack or table swap) on authentication_events_clean
- Monitor post-reclamation stability
Verify successful space reclamation
Mid Execute space reclamation
(pg_repack or table swap) on oauth_access_grants_clean and oauth_access_tokens_clean
- Monitor post-reclamation stability
Verify successful space reclamation
Late Archived tables cleanup and related code - Verify successful table drop
Release Execute space reclamation
(pg_repack or table swap) on all tables. Provide archived tables cleanup
18.7 Early Refine process based on learnings Review GitLab.com metrics
Mid Extract key learnings
Finalize documentation
- Publish final documentation

Success Factors

  1. Growth control achieved: Feature flags enabled and cleanup workers successfully running
  2. Zero authentication downtime: No disruption to authentication flows during operations
  3. Space reclaimed: Measurable reduction in table sizes after operations
  4. Self-managed compatibility: Solution works for both GitLab.com and self-managed instances
  5. Maintainable solution: Ongoing cleanup processes prevent future bloat
  6. Standardized solution: the approach is documented to benefit the entire company.

Monitoring & Validation

  1. Table size tracking: Monitor size reduction over time
  2. Worker performance: Track job duration and records processed
  3. Application impact: Monitor authentication error rates and performance
  4. Data consistency: Validate data integrity throughout the process

Under discussion:

  • Space reclamation method: Database team is working on pg_repack. If available and proven viable by end of 18.4, this could be used in Phase 2 (18.5) instead of introducing the clean table to swap, significantly lowering the complexity
  • Table swap approach: If pg_repack is not viable, need to determine if replace_table helper will work for unpartitioned tables or if a custom swap approach is needed
  • Monitoring setup: as mentioned in phase 1

Useful resources


Previous details are available here

We considered a couple of other approaches:

  1. Create a new table and switch to using that. This would need us to:
    • Create a new table
    • Batch migrate non-revoked records from the old table to the new table
    • Build fallbacks in the code to read from both tables. This means overriding a lot of functions defined in the gem.
    • Eventually remove the fallback once all records are migrated
    • Drop the old table.
    • In parallel, add a cron job that periodically deletes newly revoked records.
  2. Use partitioning with a strategy other than date range. We haven't figured out the detailed steps for this approach yet, it maybe possible with a variant of this approach.

Both of these approaches will involve significant effort, complexity and therefore risk.

Additionally, we chose to use a feature flag backed Sidekiq worker (configured as a cron job) to delete the existing records instead of a batched background migration. Reasons:

  • We can address the problem for gitlab.com first, which is where the problem is likely to be most severe
  • We have the ability to quickly turn it off with the feature flag, if things go wrong.
  • We can schedule the cron job to run at a time when the team is available for monitoring.
  • Our self managed customers can later benefit from this when we default enable the feature flag and they'd have the option to disable the feature flag if things go wrong.
  • We don't create duplicate code in a worker and a background migration

We were however recommended the batched background migration approach to delete existing records for the following reasons:

  • The cleanup would be available to self managed customers as well at the same time as gitalb.com
  • A batched background migration will run multiple laps around this solution when the database is not busy because it has a dynamic batch size and the worker approach doesn't have one.
  • We wouldn't need the runtime limiter in the worker if we used a BBM for the existing records

We would appreciate advice on the following questions:

  1. Is running full vacuum to reclaim disk space on these tables an option, knowing that this will need a maintenance window?
  2. Given that this issue will get resolved when the cells architecture is live and we eventually deprecate the legacy cell(this maybe a few years out), can we live with the current table sizes and only prevent further growth?
  3. The new archival tables that we create will also potentially grow to a large size when we delete the existing batch of records, that have never been cleaned up (order of 100s of millions). Given that these tables won't be used for live user requests, is running full vacuum an option for these tables?
  4. Which approach is recommended to delete the existing records, given we want to be able to turn off the deletion with a feature flag and we want to run it at a time when the authentication team is available for monitoring?
Edited by Daniele Bracciani