Unified Proposal for Tracking Registry Migrations, GC, and Database Health/Alert In Dedicated Admin Page (#16292) · Epics · Epics · GitLab.org

Unified Proposal for Tracking Registry Migrations, GC, and Database Health/Alert In Dedicated Admin Page

## Summary This proposal unifies requirements from https://gitlab.com/gitlab-org/gitlab/-/issues/444201+ ( and https://gitlab.com/groups/gitlab-org/-/epics/12534+ into a comprehensive Container Registry Health & Alert admin page. The page provides administrators visibility into critical registry subsystems with **direct action capabilities** and **proactive alerting**: **all database migration types** (pre-deployment, post-deployment, and background migrations), **online garbage collection**, and **comprehensive database health** (including load balancing support). **Key Capabilities**: - **Direct Action**: Apply pending post-deployment migrations and control background migrations directly from UI - **Proactive Alerting**: Admin banner alerts for critical issues requiring immediate attention - **Comprehensive Monitoring**: Unified view of migrations, GC, and database health - **Upgrade Safety**: Prevent upgrade-related downtime with clear readiness indicators ## Background & Problem Statement As Container Registry adoption grows with metadata database enablement, self-managed administrators face several operational challenges: ### 1. Migration Management & Alerting Gaps **Post-deployment migrations**: These migrations are critical for registry stability and performance: - **Must complete before the next upgrade** to prevent downtime and upgrade failures - If pending post-deployment migrations exist when upgrading to a newer registry version, the upgrade process can fail or cause extended downtime - **Administrators need the ability to apply these migrations directly from the admin interface** - **Critical pending migrations should trigger admin banner alerts** **Background migrations**: Long-running data migrations need monitoring, control, and alerting: - **No easy ability to pause migrations during high-load periods** - **Failed migrations require manual intervention that require accessing registry infrastructure (e.g database or pod nodes)** - **Some Critical failures may go unnoticed until they block other operations** - **Need proactive alerts when background migrations are stuck or must complete before upgrades** Currently, admins must rely on logs or in most cases have to connect to the server where the registry or database runs to run CLI commands or database queries to manually track migrations and migration behaviour , leading to: - Unexpected upgrade failures when post-deployment migrations are incomplete - Extended downtime during upgrades as migrations that should have run earlier are now blocking - Background migrations failing silently without admin awareness - Inability to quickly respond to migration issues ### 2. GC & Database Health Alerting Gaps **Online Garbage Collection**: Critical issues like stuck queues, high failure rates, or complete GC stalls go undetected until users report problems or storage costs spike. **Database Load Balancing**: Replica lag and connection pool saturation aren't surfaced to admins proactively, leading to: - Stale data reads from lagging replicas - Service degradation from pool exhaustion **Need**: Proactive banner alerts in Admin Area when critical thresholds are crossed. ## Objectives ### Primary Goals 1. **Enable direct migration management** from Admin Area: - Apply pending post-deployment migrations with one click - Pause/resume/retry background migrations - Clear status feedback and progress tracking 2. **Implement proactive admin banner alerting** for: - Pending post-deployment migrations blocking upgrades - Critical background migration failures or completion requirements - Severe GC issues (stuck queues, high failure rates) - Critical database issues (replica lag, pool saturation, read-only mode) 3. **Prevent upgrade-related downtime** by alerting administrators to incomplete post-deployment migrations before they attempt upgrades 4. Provide visibility into **all three migration types**: pre-deployment, post-deployment, and background migrations 5. Track background migration progress and completion estimates 6. Monitor online garbage collection health and performance 7. Support database load balancing health checks (primary/replica status, replication lag, per-host metrics) 8. Surface critical database health metrics for both standalone and load-balanced configurations ## Proposed Solution ### Location & Integration **Admin Area → Monitoring → Container Registry Health** Integrates into the existing Admin Area health check infrastructure alongside The Monolith's health checks, maintaining consistency with GitLab's monitoring patterns. ## Admin Banner Alerts ### Overview Critical issues trigger persistent banner alerts at the top of the Admin Area, visible across all admin pages until resolved. Alerts provide clear status, impact, and direct actions to resolve. ### Alert Triggers **Critical**: - Post-deployment migrations pending for >7 days (upgrade risk) - Background migration failed with retry exhausted (blocks operations) - Primary database offline (complete service failure) - GC completely stalled for >4 hours (storage issues imminent) **Warning**: - Post-deployment migrations pending (should complete before upgrade) - Background migration must complete before next upgrade - Background migration paused for >24 hours - Replica lag >30 seconds (data consistency risk) - Connection pool >90% utilized (performance degradation) - GC queue growing rapidly (storage pressure) ### Banner Design **Critical Alert Example**: ``` ┌──────────────────────────────────────────────────────────────┐ │ ✗ Container Registry Critical Issue │ │ │ │ Post-deployment migrations pending for 8 days - upgrade │ │ blocked. These migrations must complete before upgrading │ │ to prevent downtime. │ │ │ │ [Apply Migrations Now] [View Details] [Dismiss] │ └──────────────────────────────────────────────────────────────┘ ``` **Warning Alert Example**: ``` ┌──────────────────────────────────────────────────────────────┐ │ ⚠ Container Registry Attention Required │ │ │ │ Background migration "migrate_media_types" failed after 3 │ │ retries. Registry performance may degrade. │ │ │ │ [Retry Migration] [View Details] [Dismiss] │ └──────────────────────────────────────────────────────────────┘ ``` Alerts remain until issue resolved or admin explicitly dismisses ## Component 1: Database Migrations Status & Control ### Direct Action Capabilities **Post-Deployment Migrations**: - **Apply**: One-click execution of pending post-deployment migrations **Background Migrations**: - **Pause**: Temporarily halt migration execution - **Resume**: Restart paused migrations - **Retry**: Manually retry failed migrations - Real-time progress updates ### UI Design **Healthy State with Actions**: ``` ┌──────────────────────────────────────────────────────────────┐ │ Database Migrations ✓ OK │ ├──────────────────────────────────────────────────────────────┤ │ Schema Version: 20241115_add_tag_index │ │ Last Migration Applied: 2024-11-15 14:23:45 UTC │ │ Next Upgrade Ready: Yes - all migrations current │ │ │ │ ┌─ Post-Deployment Migrations ────────────────────────────┐ │ │ │ Status: All current │ │ │ │ Pending: 0 │ │ │ │ Last Applied: 20241115_add_gc_blob_index │ │ │ │ Upgrade Safety: ✓ Safe to upgrade │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ ┌─ Background Migrations ──────────────────────────────────┐ │ │ │ Status: None active │ │ │ │ Active Jobs: 0 | Queued: 0 | Failed: 0 │ │ │ └──────────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────┘ ``` **Alert State - Post-Deployment with Actions**: ``` ┌──────────────────────────────────────────────────────────────┐ │ Database Migrations ⚠ WARN │ ├──────────────────────────────────────────────────────────────┤ │ Schema Version: 20241110_baseline │ │ Next Upgrade Ready: NO - Complete migrations first │ │ │ │ ┌─ Post-Deployment Migrations ────────────────────────────┐ │ │ │ Status: ⚠ 2 pending - UPGRADE RISK │ │ │ │ Upgrade Safety: ✗ NOT safe to upgrade │ │ │ │ │ │ │ │ ⚠ These migrations must complete before upgrading │ │ │ │ │ │ │ │ Pending Migrations: │ │ │ │ │ │ │ │ • 20241115_add_gc_blob_index (High Priority) │ │ │ │ Impact: Required for v3.96.0 GC improvements │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ • 20241118_add_manifest_references_idx (Medium) │ │ │ │ Impact: Required for v3.96.0 manifest operations │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ [Apply All Migrations] [View Documentation] │ │ │ │ │ │ │ └─────────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────┘ ``` **Background Migrations - With Control Actions**: ``` ┌─ Background Migrations ──────────────────────────────────────┐ │ Status: ⚠ In Progress with Issues │ │ Running : 1 | Pending: 1 | Failed: 1 │ │ │ │ v Running Jobs: │ │ │ │ • migrate_media_types_id_bigint │ │ Progress: [████░░░░░░░░░░░░░░] 2.5M / 150M rows (1.67%) │ │ Rate: ~50K rows/min | ETA: 2 days, 3 hours │ │ Status: Running │ │ [Pause] │ │ │ │ v Pending Jobs: │ │ │ │ • backfill_repository_paths │ │ Progress: [░░░░░░░░░░░░░░░░] 0K / 892K rows (0%) │ │ ETA: 6 hours │ │ Status: Active │ │ [Pause] │ │ │ │ v Failed Jobs: │ │ │ │ • backfill_repository_paths_2 │ │ Progress: [░░░░░░░░░░░░░░░░] 2K / 892K rows (1%) │ │ ETA: 6 hours │ │ Status: Failed | Reason │ │ [Retry] │ │ │ │ [Pause All] │ └──────────────────────────────────────────────────────────────┘ ``` **During Migration Execution**: ``` ┌─ Post-Deployment Migration in Progress ──────────────────────┐ │ │ │ Applying: │ │ - 20241115_add_gc_blob_index │ │ - 20251115_add_gc_blob_index │ │ Elapsed: 8m 32s │ │ │ │ ℹ The registry remains fully operational during this │ │ migration. Do not close this page. │ │ │ │ │ └──────────────────────────────────────────────────────────────┘ ``` ## Component 2: Online Garbage Collection Health & Alerts ### Alert Triggers **Critical Alerts** (trigger banner): - GC completely stalled for >4 hours - Delete failure rate >50% for >1 hour - Queue size doubling every hour (runaway growth) - Storage backend inaccessible **Warning Alerts** (trigger banner): - Overdue task count >10,000 - Delete failure rate >20% for >24 hours - Queue growth rate concerning but not critical - GC disabled when it should be running ### UI Design **Healthy State**: ``` ┌──────────────────────────────────────────────────────────────┐ │ Online Garbage Collection ✓ OK │ ├──────────────────────────────────────────────────────────────┤ │ Status: Running │ │ Last Successful Run: 3 minutes ago │ │ │ │ Queue Status: │ │ Blob Review Queue: 1,234 items (↓ decreasing) │ │ Manifest Review Queue: 45 items (↓ decreasing) │ │ Overdue Tasks: 0 │ │ │ │ Performance (Last 24h): │ │ Blobs Deleted: 45,189 (99.9% success) │ │ Manifests Deleted: 1,523 │ │ Storage Reclaimed: 124.5 GB │ │ │ │ Health Indicators: │ │ ✓ Review queues decreasing steadily │ │ ✓ No overdue tasks detected │ │ ✓ Delete success rate above 99% │ └──────────────────────────────────────────────────────────────┘ ``` **Critical Alert State**: ``` ┌──────────────────────────────────────────────────────────────┐ │ Online Garbage Collection ✗ CRITICAL │ ├──────────────────────────────────────────────────────────────┤ │ Status: Stalled - No progress in 4.5 hours │ │ │ │ ✗ CRITICAL ISSUE: │ │ GC has stopped processing tasks. Storage will continue │ │ growing until resolved. │ │ │ │ Queue Status: │ │ Blob Review Queue: 156,892 items (↑ growing) │ │ Manifest Review Queue: 12,450 items (↑ growing) │ │ Overdue Tasks: 169,342 (ALL overdue) │ │ │ │ Last Successful Deletion: 4 hours, 32 minutes ago │ │ │ │ │ │ │ │ [Troubleshooting Guide] │ └──────────────────────────────────────────────────────────────┘ ``` ## Component 3: Database Health Monitoring & Alerts ### Alert Triggers **Critical Alerts** (trigger banner): - Primary database offline - Database in read-only mode (blocks all writes) - Connection pool 100% saturated for >5 minutes - All replicas offline or severely lagging (>60s) **Warning Alerts** (trigger banner): - Replica lag >30 seconds - Connection pool >90% for >15 minutes - Single replica offline in multi-replica setup ### UI Design **Healthy Load Balanced State**: ``` ┌──────────────────────────────────────────────────────────────┐ │ Database Health ✓ OK │ ├──────────────────────────────────────────────────────────────┤ │ Configuration: Load Balanced (1 primary + 2 replicas) │ │ Database Version: PostgreSQL 14.10 │ │ │ │ ┌─ Primary Database ───────────────────────────────────────┐ │ │ │ Host: registry-db-primary.example.com:5432 │ │ │ │ Status: ✓ Healthy | Role: Read-Write │ │ │ │ Pool: 12/25 active (48%) | Query Time: 2.3ms │ │ │ └──────────────────────────────────────────────────────────┘ │ │ │ │ ┌─ Replica 1 ──────────────────────────────────────────────┐ │ │ │ Host: registry-db-replica1.example.com:5432 │ │ │ │ Status: ✓ Healthy | Lag: 0.12s (current) │ │ │ │ Pool: 6/25 active (24%) | Reads: 12.3K/min │ │ │ └──────────────────────────────────────────────────────────┘ │ │ │ │ ┌─ Replica 2 ──────────────────────────────────────────────┐ │ │ │ Host: registry-db-replica2.example.com:5432 │ │ │ │ Status: ✓ Healthy | Lag: 0.09s (current) │ │ │ │ Pool: 8/25 active (32%) | Reads: 15.7K/min │ │ │ └──────────────────────────────────────────────────────────┘ │ │ │ │ Load Distribution: 84.5% reads load balanced │ │ Config: ✓ All hosts read-write enabled │ └──────────────────────────────────────────────────────────────┘ ``` **Critical Alert State**: ``` ┌─────────────────────────────────────────────────────────────┐ │ Database Health ✗ CRITICAL│ ├─────────────────────────────────────────────────────────────┤ │ Configuration: Standalone Database │ │ │ │ ✗ CRITICAL ISSUE: Database in Read-Only Mode │ │ │ │ Impact: │ │ • All write operations blocked │ │ • Migrations cannot run │ │ • Online GC cannot function │ │ • New images cannot be pushed │ │ │ │ [Troubleshooting Guide] │ └─────────────────────────────────────────────────────────────┘ ``` **Replica Lag Warning**: ``` ┌──────────────────────────────────────────────────────────────┐ │ Database Health ⚠ WARN │ ├──────────────────────────────────────────────────────────────┤ │ Configuration: Load Balanced (1 primary + 2 replicas) │ │ │ │ ⚠ Replica Lag Detected │ │ │ │ ┌─ Replica 2 ──────────────────────────────────────────────┐ │ │ │ Host: registry-db-replica2.example.com:5432 │ │ │ │ Status: ⚠ Lagging │ │ │ │ Replication Lag: 35.8 seconds (⚠ exceeds 30s threshold) │ │ │ │ Load Balancer: Temporarily excluded from rotation │ │ │ │ │ │ │ │ Impact: Reduced read capacity, potential data staleness │ │ │ └──────────────────────────────────────────────────────────┘ │ │ │ │ [Troubleshoot] │ └──────────────────────────────────────────────────────────────┘ ``` ## Alert Notification System ### Architecture **Banner Alert Management**: - Monitors health check results continuously - Creates/updates/resolves banner alerts **Alert Priority Queue**: - Multiple alerts displayed in priority order (critical, then warning) - Maximum 3 concurrent banners to avoid alert fatigue - Lower priority alerts queued until space available **Notification Channels** (Future): - Email notifications for critical alerts - Webhook integration for external systems (PagerDuty, Slack) ### Alert Content Requirements Each alert includes: - **Clear Status**: What is wrong and why it matters - **Impact Statement**: What breaks if not resolved - **Action Buttons**: Direct links to resolution (apply migrations, view details) - **Severity Level**: Visual indicator (red/yellow) with icon - **Dismiss Options**: permanent dismissal ### Integration Points **Existing GitLab Features**: - Uses existing banner notification system - Respects admin notification preferences - Links to relevant Admin Area sections ## Technical Architecture ### High-Level Component Overview **Frontend (Vue.js)**: - Health dashboard with live updates and management panel - Migration control interface - Alert banner component system **GitLab Rails (Orchestration Layer)**: - REST API endpoints exposing registry health to UI - Migration control endpoints (apply, pause, resume, retry) - Health evaluation and Alert management service - Periodic UI updates - Background job scheduler for periodic health checks - Proxy/orchestrator for Container Registry operations **Container Registry:** - Migration execution/tracking engine - Health check endpoints exposing status (Across migrations, GC and DB health) - Internal APIs for status/control (Across migrations, GC and DB health) **Data Flow:** - GitLab Rails → Container Registry API (control operations) → Container Registry → Registry Database (migration execution/status, GC status, DB health) - GitLab Rails ← Container Registry (health status, progress updates) - Vue UI ← GitLab Rails (aggregated health data via WebSocket/REST) ### High Level Execution Flow **Post-Deployment Migration Control Flow**: 1. Admin clicks "Apply Migration" button in GitLab UI (Vue) 2. Request sent to GitLab Rails API endpoint 3. Rails validates admin permissions and registry connectivity 4. Rails queues Sidekiq job to execute migration 5. Sidekiq job calls **Container Registry API**: `POST /gitlab/v1/admin/database/post_deploy_migrations/apply` 6. **Container Registry** starts to execute migration against its own database 7. Registry returns job ID or migration handle 8. Rails polls Registry API: `GET /gitlab/v1/admin/database/post_deploy_migrations/:id/status` 9. Registry returns progress status 10. Rails broadcasts progress updates to UI via WebSocket 11. On completion, Registry API `GET /gitlab/v1/admin/database/post_deploy_migrations/:id/status` returns success/failure status 12. Rails resolves alert if successful (if needed) **Background Migration Control Flow**: 1. Admin clicks control button (pause/resume/retry) in GitLab UI 2. Request sent to GitLab Rails control endpoint 3. Rails validates permissions 4. Rails calls **Container Registry API**: `POST /gitlab/v1/admin/database/background_migrations/:id/pause` 5. **Container Registry** updates migration status in its own database (`batched_background_migrations` table) 6. Registry background workers detect status change and respond accordingly 7. Registry API returns updated status immediately 8. Rails broadcasts state changes to UI via WebSocket 10. Rails continues polling `GET /gitlab/v1/admin/database/background_migrations/:id` for progress/status updates (when needed) **Health Check Flow**: 1. GitLab Sidekiq job runs every 30 seconds (background) 2. Job calls **Container Registry health API endpoints**: - `GET /gitlab/v1/admin/health` - overall health - `GET /gitlab/v1/admin/database/health` - database health & load balancing - `GET /gitlab/v1/admin/migrations/status` - overall migration status - `GET /gitlab/v1/admin/database/post_deploy_migrations/status` post deploy migration health - `GET /gitlab/v1/admin/database/background_migrations/status` background migration health - `GET /gitlab/v1/admin/gc/health` - garbage collection health 3. **Container Registry** queries its own database and returns status 4. Rails evaluates responses against alert thresholds 5. Rails creates/updates/resolves alerts in GitLab database 6. Rails broadcasts alert changes to connected UI clients via WebSocket 7. UI receives updates and displays banner alerts ### Container Registry API Endpoints (New) **Migration Management**: - `POST /gitlab/v1/admin/database/post_deployment_migrations/apply` - Apply pending post-deployment migrations - `GET /gitlab/v1/admin/database/migrations/:id/status` - Get migration execution progress - `GET /gitlab/v1/admin/migrations/status` - Get all migration status (pre/post/background) **Background Migration Control**: - `POST /gitlab/v1/admin/database/background_migrations/:id/pause` - Pause running migration - `POST /gitlab/v1/admin/database/background_migrations/:id/resume` - Resume paused migration - `POST /gitlab/v1/admin/database/background_migrations/:id/retry` - Retry failed migration - `GET /gitlab/v1/admin/database/background_migrations` - List all background migrations - `GET /gitlab/v1/admin/database/background_migrations/:id` - Get specific migration details **Health Monitoring**: - `GET /gitlab/v1/admin/health` - Overall registry health - `GET /gitlab/v1/admin/database/health` - Database health with load balancing details - `GET /gitlab/v1/admin/gc/health` - Garbage collection health and queue status ## Security Considerations ### Security **Action Authorization**: - Admin-only access for all control actions - Confirmation required for destructive operations ## Success Criteria ### Quantitative Goals - **90% of migration management** done via UI (vs. CLI or direct database queries) - **Zero upgrade failures** due to incomplete migrations after alert implementation - **reduction** in registry support tickets relating to gc or database ### Qualitative Goals - Admins feel confident managing registry health without CLI access - Clear upgrade readiness status prevents surprise failures - Background migration issues resolved before impacting users - GC and database problems detected proactively - Self-service troubleshooting reduces support burden ## References ### Primary Sources - [Issue #444201: Health Check Alert for Missing Post-Deployment Migrations](https://gitlab.com/gitlab-org/gitlab/-/issues/444201) - [Epic #12534: Alert User when Online GC is not Working Properly](https://gitlab.com/groups/gitlab-org/-/epics/12534) ### Documentation - [Container Registry Metadata Database](https://docs.gitlab.com/administration/packages/container_registry_metadata_database/) - [Container Registry Database Load Balancing](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/gitlab/database-load-balancing.md) - [GitLab Database Load Balancing](https://docs.gitlab.com/administration/postgresql/database_load_balancing/) - [Container Registry Configuration](https://docs.gitlab.com/charts/charts/registry/)

epic