Bug: Investigate Apdex SLO violation for manifest read requests
Problem
Apdex SLO was violated for manifest read (HEAD or GET) requests in the registry's with Apdex dropping as low as 97.9%.
Impact
Manifest read requests for container registry were slower than expected, temporarily breaching the service level objective for user experience.
Key Discovery
The issue resolves immediately upon deployment, indicating this is an application-level bug rather than infrastructure-related. Analysis of the 30-day dashboard view shows:
- SLI starts degrading rapidly when regular deployments are paused (e.g., weekends)
- SLI recovers as soon as deployments resume
- This pattern strongly suggests an ongoing source code bug introduced around the start of December, which auto resolves, or better said, drastically improves, when the application restarts
Investigation Details
- Service: registry
- SLI: server_route_manifest_reads
- Alert Name: RegistryServiceServerRouteManifestReadsApdexSLOViolationRegional
- Observed Symptoms: Two periods of increased database connection saturation and queuing
- Performance Recovery: Automatic recovery without manual intervention (upon deployment)
Related Resources
- Grafana Dashboard: https://dashboards.gitlab.net/d/registry-main/r...
- 30-day Dashboard View: https://dashboards.gitlab.net/goto/af7urma0c8w00c?orgId=1
- Slack Discussion: https://gitlab.slack.com/archives/CRD4A8HG8/p1766061850388009
- Previously Investigated Issue (likely separate): #1785 (closed)
Next Steps
- Review commits and changes introduced in early December
- Analyze database connection handling and query patterns
- Investigate potential memory leaks or resource accumulation over time
- Check for any changes to manifest read request handling logic
Edited by João Pereira