0. Secure Data Consistency Worker
## Summary
Implement a modular data consistency framework for the Security domain that can detect and remediate data inconsistencies through:
1. **Application-wide cron schedule** - Periodic consistency checks across all data
2. **Project-triggered execution** - Targeted checks on projects with active pipelines (with cooldown cycle to prevent thrashing)
This is critical infrastructure for the Vulnerabilities Across Contexts (VAC) closed beta, providing an alternative to BBMs for ensuring data integrity during gradual feature rollout.
## Why This Matters for VAC
BBMs have been identified as a **critical delay risk** for VAC closed beta delivery. The consistency worker provides a path forward:
- **No migration dependency**: Can fix data incrementally without waiting for BBMs to complete
- **Ongoing integrity**: Continues to maintain data consistency as VAC rolls out to more customers
- **Tracked context safety net**: Ensures `security_project_tracked_contexts` relationships stay consistent with vulnerability data
- **Gradual remediation**: Fixes data at a sustainable pace rather than requiring large-scale migrations
---
## Architecture Decisions (2026-02-26)
Following a pairing session between @ghavenga and @bwill, we've refined the architecture:
### Key Realizations
1. **Initial BBO MRs (!223260, !223391) need rework** - They iterate individual tables, but should iterate `projects` table and launch orchestrator per project
2. **Two entry points, same check modules:**
- **Pipeline trigger** → `Orchestrator.ensure_consistency_on_project!(project)` (Brian's MR focus)
- **BBO sweep** → Iterates projects, checks recency, calls same `Orchestrator` per project (Gregory's follow-up)
### BBO Sweep Design
- Iterates **projects** table, NOT individual security tables
- Checks Redis recency key before dispatching (skip recently-checked projects)
- **Dual health check**: Main DB (for project iteration) + Sec DB (for check pressure)
### Worker Separation Strategy
| Check Type | Approach |
|------------|----------|
| Small/simple (e.g., PS-FLAG) | Check + fix in same worker |
| Table enumeration (e.g., VR-VO) | Separate Check worker from Fix worker to avoid DB connection stickiness |
Note: Check and fix can still be in same class architecturally, just executed in different workers.
### Table Iteration Efficiency
- If multiple checks need the same table (e.g., `vulnerability_reads`), enumerate once and run all relevant checks in that pass
- Reduces database thrashing
### Current MR Status
| MR | Purpose | Status |
|----|---------|--------|
| !224151 | @bwill's project-level orchestrator | Focus on pipeline trigger MVP |
| !223260 | VR-VO BBO check | **MERGED** - needs rework to new architecture |
| !223391 | TC-DUP BBO check | **OPEN** - needs rework to new architecture |
### Division of Work
- **@bwill**: Implement pipeline-triggered project-level orchestrator (!224151)
- **@ghavenga**: Rework !223260/!223391 to become check modules, then implement BBO sweep that calls orchestrator per project
---
## Unified Architecture: Project-Level + BBO-Level
### Two Complementary Execution Modes
The consistency worker framework operates at two levels that work together:
| Level | Trigger | Scope | Purpose |
|-------|---------|-------|---------|
| **Project-Level** | Post-ingestion (after security scan completes) | Single project | Immediate consistency after data changes |
| **BBO-Level** | Cron schedule (configurable interval, e.g., daily/weekly) | Application-wide sweep | Catch-all safety net, handles drift |
### Coordination: Skip Recently-Checked Projects
To avoid redundant work, the two levels coordinate via Redis:
1. **Project-level execution** sets a Redis key with TTL matching BBO sweep interval:
```
Key: consistency_check:project:{project_id}:last_run
Value: timestamp
TTL: Configurable (default: 24-48 hours)
```
2. **BBO sweep** checks for this key before processing each project:
- If key exists → skip project (recently checked)
- If key missing → run checks on project
3. **TTL is configurable** via application settings to tune check frequency as we learn optimal thresholds
### Shared Check Implementation
Both levels use the same check modules under `Vulnerabilities::ConsistencyChecks`:
```
ee/lib/vulnerabilities/consistency_checks/
├── registry.rb # Shared check list
├── base_check.rb # Common interface: execute, consistent?, fix!
├── orchestrator.rb # Project-level orchestrator (post-ingestion)
├── sweep_orchestrator.rb # BBO-level sweep orchestrator
├── checks/
│ ├── vr_vo_validity_check.rb
│ ├── has_vulnerabilities_check.rb
│ ├── duplicate_default_contexts_check.rb
│ └── orphaned_context_refs_check.rb
ee/app/workers/vulnerabilities/
├── consistency_check_worker.rb # Per-project async worker
├── consistency_sweep_worker.rb # BBO cron worker
```
### Related Work
- **!224151** - @bwill's project-level orchestrator framework (Draft)
- **!223260** - VR-VO consistency check using BBO framework (@ghavenga) - MERGED, needs rework
- **!223391** - Duplicate default contexts check (@ghavenga) - needs rework
---
## Phase 1: Consistency Checks for Closed Beta
### High Priority (Must Have)
| ID | Check | Remediation | Notes |
|----|-------|-------------|-------|
| **VR-VO** | `vulnerability_reads` has valid `vulnerability_occurrence_id` | Set from `vulnerabilities.finding_id` or clear if invalid | Critical - occurrences becoming the true representative model for reads |
| **TC-DUP** | Duplicate default tracked contexts per project | Collapse into one matching project's actual default branch, update all refs | Known issue, !221673 has BBM but we need faster fix |
| **TC-ORPHAN** | Orphaned `security_project_tracked_context_id` references | Clear the reference (set NULL) | Cleanup should have run first; no FK on these columns |
| **PS-FLAG** | `has_vulnerabilities` flag doesn't match actual vuln count | Update flag to match reality | Many BBMs and important code rely on this |
| **VO-ID-FILL** | Tables with `vulnerability_occurrence_id` column not fully populated | Backfill from `vulnerabilities.finding_id` | See table below |
### Medium Priority (Good to Have)
| ID | Check | Remediation | Notes |
|----|-------|-------------|-------|
| **VR-ORPHAN** | `vulnerability_reads` without corresponding `vulnerability_occurrence` | Delete orphaned read | Helps transition from DB trigger to app-level management |
| **VO-CONTEXT** | Finding project doesn't match tracked context project | Log and alert (data corruption) | Important but unlikely; security isolation check |
| **VR-CONTEXT** | Read project doesn't match tracked context project | Log and alert (data corruption) | Safety check, unlikely but good defense in depth |
### New Checks (2026-02-26)
| ID | Check | Remediation | Issue |
|----|-------|-------------|-------|
| **ES-REF** | ES vulnerability documents reference valid `vulnerability_reads` | Verify ES `_id` maps to existing `vulnerability_reads.id` | #591621 |
| **TRIGGER-PARITY** | DB trigger output matches app-level computation | Compare `insert_vulnerability_reads_from_vulnerability` trigger vs app logic for safe deprecation | #591623 |
### Deprioritized (Future)
| ID | Check | Notes |
|----|-------|-------|
| **V-FINDING** | Dangling `finding_id` on vulnerabilities | Column will be removed long-term; occurrences will belong to vulnerabilities |
---
## Tables with `vulnerability_occurrence_id` (VAC Additions)
These tables had `vulnerability_occurrence_id` added as part of VAC and need the consistency check:
| Table | Backfill Status | Notes |
|-------|-----------------|-------|
| `vulnerability_reads` | ✅ Complete | Unique index being added |
| `vulnerability_issue_links` | ✅ Complete | |
| `vulnerability_merge_request_links` | ✅ Complete | |
| `vulnerability_external_issue_links` | ✅ Complete | |
| `vulnerability_severity_overrides` | ✅ Complete | |
| `vulnerability_state_transitions` | ✅ Complete | |
| `vulnerability_representation_information` | ✅ Complete | |
| `vulnerability_user_mentions` | ❌ **NO BACKFILL** | Needs migration |
| `sbom_occurrences_vulnerabilities` | ❌ **NO BACKFILL** | Needs migration |
| `vulnerability_detection_transitions` | ✅ Complete | New table, NOT NULL constraint |
**Consistency check should verify:** All non-NULL `vulnerability_occurrence_id` values reference valid `vulnerability_occurrences` records.
---
## Duplicate Default Context Remediation
Related: #582172, !221673 (BBM by @srushik)
### Detection
```sql
SELECT project_id, COUNT(*)
FROM security_project_tracked_contexts
WHERE is_default = true
GROUP BY project_id
HAVING COUNT(*) > 1;
```
### Remediation Approach
1. Find the context whose `context_name` matches project's actual `default_branch`
2. If none match, pick the oldest one and update its `context_name`
3. Update all references in large tables (`vulnerability_occurrences`, `vulnerability_reads`) using batched updates
4. Update references in small tables (`vulnerability_statistics`, `vulnerability_historical_statistics`) directly
5. Delete the duplicate context records
6. Set `is_default = true` only on the surviving context
---
## Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ Consistency Worker Framework │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Triggers: │
│ ┌──────────────────┐ ┌─────────────────────────────────────┐ │
│ │ Cron Schedule │ │ Pipeline Completion (with cooldown) │ │
│ │ (application- │ │ (project-scoped, rate-limited) │ │
│ │ wide sweep) │ │ │ │
│ └────────┬─────────┘ └──────────────┬──────────────────────┘ │
│ │ │ │
│ └──────────────┬───────────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Check Modules │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Check A │ │ Check B │ │ Check C │ ... │ │
│ │ │ (bundled │ │ (bundled │ │ │ │ │
│ │ │ w/ Check D)│ │ w/ Check E)│ │ │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ │ Checks can be bundled to reduce repeated table iterations │ │
│ │ Each check logs: record IDs, ownership (project/namespace) │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ Inconsistencies detected │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Remediation Queue │ │
│ │ (Separate worker to prevent DB connection stickiness) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Remediation │ │ Remediation │ │ Remediation │ ... │ │
│ │ │ Module A │ │ Module B │ │ Module C │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ │ Each remediation logs: fix applied, record IDs, ownership │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
```
## Design Principles
1. **Modular checks**: Each consistency check is a self-contained module that can be enabled/disabled independently
2. **Bundled iterations**: Checks that scan the same tables can be bundled together to reduce redundant table scans
3. **Separate remediation workers**: Fixes run in separate workers to prevent check queries (which can use replicas) from holding connections that then write to primary
4. **Cooldown cycles**: Project-triggered checks have rate limiting to prevent pipeline-heavy projects from overwhelming the system
5. **Read replica safe**: Check queries designed to run against read replicas where possible
## Logging Requirements
> **Note:** Detection and remediation logs could become very dense at scale. We may need to implement batched/summarized logging (e.g., "X records with inconsistency Y in project Z") rather than individual record logs. Design should accommodate this from the start.
### Consistency Checks
When a check detects inconsistent records, it must log:
- Record IDs (primary key)
- Ownership context (project_id, namespace_id)
- Check module name
- Description of the inconsistency detected
- Timestamp
Example (batched):
```
[SecureDataConsistency] Check=TrackedContextConsistency inconsistencies_detected
record_type=VulnerabilityRead count=847 project_id=278964 namespace_id=9970
record_ids=[12345,12346,12347,...] issue="security_project_tracked_context_id references non-existent context"
```
### Remediations
When a remediation fixes records, it must log:
- Record IDs affected
- Ownership context (project_id, namespace_id)
- Remediation module name
- Fix applied (before/after or action taken)
- Timestamp
Example (batched):
```
[SecureDataConsistency] Remediation=TrackedContextConsistency fixes_applied
record_type=VulnerabilityRead count=847 project_id=278964 namespace_id=9970
record_ids=[12345,12346,12347,...] action="cleared orphaned security_project_tracked_context_id"
```
## Feature Flag Strategy
### Framework-Level Flag
| Flag | Purpose | Initial State |
|------|---------|---------------|
| `secure_data_consistency_worker` | Master switch for entire framework | Disabled |
| `secure_data_consistency_worker_actor` | Actor-based targeting for closed beta projects | Disabled |
### Per-Check Flags
| Flag | Check Module | Initial State |
|------|--------------|---------------|
| `secure_consistency_check_vr_occurrence` | VulnerabilityReadsOccurrence | Disabled |
| `secure_consistency_check_duplicate_contexts` | DuplicateDefaultContexts | Disabled |
| `secure_consistency_check_orphaned_contexts` | OrphanedTrackedContextRefs | Disabled |
| `secure_consistency_check_has_vulnerabilities` | HasVulnerabilitiesFlag | Disabled |
| `secure_consistency_check_occurrence_id_fill` | OccurrenceIdBackfill | Disabled |
### Per-Remediation Flags
| Flag | Remediation Module | Initial State |
|------|-------------------|---------------|
| `secure_consistency_fix_vr_occurrence` | VulnerabilityReadsOccurrence | Disabled |
| `secure_consistency_fix_duplicate_contexts` | DuplicateDefaultContexts | Disabled |
| `secure_consistency_fix_orphaned_contexts` | OrphanedTrackedContextRefs | Disabled |
| `secure_consistency_fix_has_vulnerabilities` | HasVulnerabilitiesFlag | Disabled |
| `secure_consistency_fix_occurrence_id_fill` | OccurrenceIdBackfill | Disabled |
### Rollout Strategy
1. **Phase 1 - Closed Beta**: Enable framework + checks only (no remediations) on targeted projects via actor flag
2. **Phase 2 - Validate Logging**: Review logs to confirm checks are detecting expected inconsistencies
3. **Phase 3 - Enable Remediations**: Enable remediation flags one at a time on closed beta projects
4. **Phase 4 - Gradual Rollout**: Increase actor percentage, eventually enable globally
## Execution Modes
### 1. Cron Schedule (Application-Wide)
- Runs on low-priority Sidekiq queue
- Iterates through all records in batches (BBM-style cursor tracking)
- Designed for overnight/low-traffic execution
- Full sweep of all check modules
### 2. Pipeline-Triggered (Project-Scoped)
- Triggered after security scan ingestion completes
- Scoped to single project
- Cooldown period (e.g., max once per hour per project)
- Only runs VAC-critical checks
- Ensures data consistency immediately after ingestion
## Success Criteria
- [ ] Framework supports adding new check modules without code changes to core worker
- [ ] Checks and remediations are decoupled (checks don't hold DB connections during fixes)
- [ ] Feature flags at framework, check, and remediation levels
- [ ] Actor-based targeting for closed beta projects
- [ ] Comprehensive logging with record IDs and ownership for debugging
- [ ] Batched logging support for high-volume scenarios
- [ ] Metrics for inconsistencies detected and fixed
- [ ] No impact to Apdex when running during normal traffic
## Historical Context
This was originally proposed in https://gitlab.com/gitlab-org/gitlab/-/issues/494257 (September 2024) as a general data consistency safety net. At that time, the consensus was that it was lower priority than improving underlying system reliability.
The context has changed significantly with VAC:
- BBMs are now a critical path blocker for closed beta
- We need ongoing data consistency guarantees as VAC rolls out incrementally
- Tracked context relationships introduce new consistency requirements that benefit from continuous validation
epic