Solution for large table maintenance
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Description
The groupauthentication team is currently working on table size maintenance for 3 large, growing tables:
- oauth_access_grants (~43 GB)
- oauth_access_tokens (~200 GB)
- authentication_events (~53 GB)
The oauth_access_tokens
and oauth_access_grants
tables have a large number of "revoked" rows, that can be safely deleted. The plan is to add a cron job that periodically deletes the unneeded rows (first moving to a separate archival table and then deleting from the archival table after a specific time period).
Just deleting the rows will not reclaim the space though, it will only prevent further growth and these tables don't lend themselves well to partitioning since they are mostly queried by individual text columns (token and refresh_token).
Following is the plan to handle this cleanup, based on our discussion so far.
Note: part of this plan can change (phase 2 - 4), according to the pg_repack implementation.
Policies
- Authentication_events: 1 year worth of events
- OAuth access grants: Active grants and 1 month of revoked grants
- OAuth access token: Active oauth access tokens, active refresh tokens, 1 month of revoked tokens.
Expected Results
- 3 temporary archive tables (new) - persisted for 2 or 3 milestones as soft-deletion purposes
authentication_events_archived_records
oauth_access_grants_archived_records
oauth_access_tokens_archived_records
- 3 cleanup workers (new)
- 3 feature flags (new) for cleanup workers:
archive_revoked_authentication_events
archive_revoked_access_grants
archive_revoked_access_tokens
- 3 temporary clean tables (new), if pg_repack is not viable:
authentication_events_clean
oauth_access_grants_clean
oauth_access_tokens_clean
- 3 original tables reclaimed (either via pg_repack or table swap):
authentication_events
oauth_access_grants
oauth_access_tokens
Phase 1: Archive Infrastructure & Initial Cleanup (18.4 - 18.5)
Scope
Create archive infrastructure and begin actual cleanup on GitLab.com
Actions
-
authentication_events (beginning of 18.4):
- Create
authentication_events_archived_records
, unpartitioned for soft-deletion purposes - Implement archival worker with batch size of 10,000 and sub-batch size of 1,000
-
Enable feature flag on GitLab.com ONLY:
archive_revoked_authentication_events: true
(GitLab.com),false
(self-managed) - Begin moving revoked records to archive
- Monitor cleanup performance and impact
- Addressed with #545007
- Create
-
OAuth tables (middle of 18.4):
- Same infrastructure setup
-
Enable on GitLab.com:
archive_revoked_access_grants
andarchive_revoked_access_tokens
astrue
(GitLab.com),false
(self-managed) - Start actual cleanup process
- Addresses with:
When confident, end of 18.4/18.5:
-
Enable archival on self-managed:
- Feature flags globally enabled
archive_revoked_authentication_events
archive_revoked_access_grants
archive_revoked_access_tokens
- Feature flags globally enabled
#### Additionally, if pg_repack is viable:
- Work with Database Operations team to schedule pg_repack operations - Test pg_repack on smaller tables first - Prepare monitoring and rollback plansConsideration
- GitLab.com gets immediate benefit: cleanup starts reducing table growth
- Self-managed protection: feature flags not enabled globally until proven stable
- Team monitoring: team can monitor cleanup runs
- Decision point: Evaluate pg_repack progress by end of 18.4 to determine Phase 2 approach
Open questions
How to achieve Monitoring setup:
- Prometheus metrics for records archived and job duration
- Grafana dashboard for table size trends
- Error tracking in Sentry
Phase 2: Space Reclamation Preparation (18.6 - on hold)
Scope
At this point, the original tables are holding the records according to the policy, but the table size on disk is the original one.
Prepare for space reclamation using either pg_repack or clean table approach
Actions
If pg_repack is not viable:
-
authentication_events (beginning of 18.6):
- Create
authentication_events_clean
table - Implement table sync triggers triggers
- Begin backfill of non-revoked records
- Monitor sync accuracy
- Create
-
OAuth tables (middle of 18.6):
- Same process for OAuth tables:
oauth_access_grants_clean
oauth_access_tokens_clean
- Same process for OAuth tables:
-
Validation period (mid-late 18.6):
- Monitor sync accuracy between original and clean tables
- Verify archival process stability
- Confirm no performance degradation
- Complete backfill process
#### If using pg_repack:
- Execute pg_repack operations on GitLab.com during low-traffic periods - Monitor database performance during and after operations - Discuss the roll out to self-managed (target: next release?)Consideration
- The self-managed cleanup worker must have been activated in the previous milestone to prevent potential workload issues.
- Backfill must complete before any swap attempt
- Sync validation: continuous monitoring of data consistency
- Performance impact: measure and adjust batch sizes
Phase 3: Space Reclamation Execution (18.7)
Scope
At this point, clean and original tables are in sync.
Execute table swap after proven stability and validate.
Actions
If using table swap:
-
First table swap execution (early 18.7):
- Execute swap on GitLab.com using
replace_table
helper or dedicated migration onauthentication_events_clean
- Monitor for 1 week post-swap
- Execute swap on GitLab.com using
-
Second table swap execution (mid 18.7):
- Execute swap on GitLab.com on
oauth_access_grants_clean
andoauth_access_tokens_clean
- Post-swap monitoring
- Execute swap on GitLab.com on
-
Archived tables cleanup (late 18.7):
- Drop the three archival table:
authentication_events_archived_records
oauth_access_grants_archived_records
oauth_access_tokens_archived_records
- Cleanup related code
- Drop the three archival table:
If using pg_repack
- Then at this point we already have the original tables resized.
- Anticipate
Phase 4: Key learnings and documentation (18.8)
Consideration
- GitLab.com first: validate swap process before self-managed
- Testing: Use DBLab clones for migration testing before production
- Proven stability on GitLab.com before self-managed rollout
Phase 4: Key learnings and documentation (18.8)
Scope
Extract main key learnings and finalize documentation
Actions
- Provide structured documentation to be discussed and published under https://docs.gitlab.com/development/database/
Hiding the timeline for now until we get clarity on pg_repack
availability and which other environments to target.
Old timeline
Timeline
Phase | Timing | GitLab.com | Self-Managed | Validation Checkpoints |
---|---|---|---|---|
18.4 | Early | Deploy archive infrastructure Enable cleanup for authentication_events |
- | Monitor initial cleanup performance |
Mid | Enable cleanup for OAuth tables | - | Verify no performance degradation | |
Late | - | - | Confirm cleanup is working as expected | |
Release | - | Provide archive infrastructure and enable cleanup for all tables (if GitLab.com stable) | ||
18.5 | Early | Evaluate pg_repack viability | - | Validate pg_repack effectiveness |
Early | Deploy authentication_events_clean table (if pg_repack not viable) Begin backfill process and sync |
- | Monitor backfill and sync performance | |
Mid | Deploy oauth_access_grants_clean and oauth_access_tokens_clean tableBegin backfill process and sync |
Monitor backfill and sync performance | ||
Late | Validation period for space reclamation Document approach |
Complete preparation | Verify data consistency Confirm sync accuracy Verify no performance issues |
|
Release | - | Provide clean tables infrastructure and begin backfill process (if pg_repack not viable) |
||
18.6 | Early | Execute space reclamation (pg_repack or table swap) on authentication_events_clean
|
- | Monitor post-reclamation stability Verify successful space reclamation |
Mid | Execute space reclamation (pg_repack or table swap) on oauth_access_grants_clean and oauth_access_tokens_clean
|
- | Monitor post-reclamation stability Verify successful space reclamation |
|
Late | Archived tables cleanup and related code | - | Verify successful table drop | |
Release | Execute space reclamation (pg_repack or table swap) on all tables. Provide archived tables cleanup |
|||
18.7 | Early | Refine process based on learnings | Review GitLab.com metrics | |
Mid | Extract key learnings Finalize documentation |
- | Publish final documentation |
Success Factors
- Growth control achieved: Feature flags enabled and cleanup workers successfully running
- Zero authentication downtime: No disruption to authentication flows during operations
- Space reclaimed: Measurable reduction in table sizes after operations
- Self-managed compatibility: Solution works for both GitLab.com and self-managed instances
- Maintainable solution: Ongoing cleanup processes prevent future bloat
- Standardized solution: the approach is documented to benefit the entire company.
Monitoring & Validation
- Table size tracking: Monitor size reduction over time
- Worker performance: Track job duration and records processed
- Application impact: Monitor authentication error rates and performance
- Data consistency: Validate data integrity throughout the process
Under discussion:
-
Space reclamation method: Database team is working on
pg_repack
. If available and proven viable by end of 18.4, this could be used in Phase 2 (18.5) instead of introducing the clean table to swap, significantly lowering the complexity -
Table swap approach: If pg_repack is not viable, need to determine if
replace_table
helper will work for unpartitioned tables or if a custom swap approach is needed - Monitoring setup: as mentioned in phase 1
Useful resources
- Partition strategy: Using Time-range Partitioning
- Grafana dashboard to monitor table size: Explore
Previous details are available here
We considered a couple of other approaches:
- Create a new table and switch to using that. This would need us to:
- Create a new table
- Batch migrate non-revoked records from the old table to the new table
- Build fallbacks in the code to read from both tables. This means overriding a lot of functions defined in the gem.
- Eventually remove the fallback once all records are migrated
- Drop the old table.
- In parallel, add a cron job that periodically deletes newly revoked records.
- Use partitioning with a strategy other than date range. We haven't figured out the detailed steps for this approach yet, it maybe possible with a variant of this approach.
Both of these approaches will involve significant effort, complexity and therefore risk.
Additionally, we chose to use a feature flag backed Sidekiq
worker (configured as a cron job) to delete the existing records instead of a batched background migration. Reasons:
- We can address the problem for gitlab.com first, which is where the problem is likely to be most severe
- We have the ability to quickly turn it off with the feature flag, if things go wrong.
- We can schedule the cron job to run at a time when the team is available for monitoring.
- Our self managed customers can later benefit from this when we default enable the feature flag and they'd have the option to disable the feature flag if things go wrong.
- We don't create duplicate code in a worker and a background migration
We were however recommended the batched background migration approach to delete existing records for the following reasons:
- The cleanup would be available to self managed customers as well at the same time as gitalb.com
- A batched background migration will run multiple laps around this solution when the database is not busy because it has a dynamic batch size and the worker approach doesn't have one.
- We wouldn't need the runtime limiter in the worker if we used a BBM for the existing records
We would appreciate advice on the following questions:
- Is running full vacuum to reclaim disk space on these tables an option, knowing that this will need a maintenance window?
- Given that this issue will get resolved when the cells architecture is live and we eventually deprecate the legacy cell(this maybe a few years out), can we live with the current table sizes and only prevent further growth?
- The new archival tables that we create will also potentially grow to a large size when we delete the existing batch of records, that have never been cleaned up (order of 100s of millions). Given that these tables won't be used for live user requests, is running full vacuum an option for these tables?
- Which approach is recommended to delete the existing records, given we want to be able to turn off the deletion with a feature flag and we want to run it at a time when the authentication team is available for monitoring?