Bug: Investigate Apdex SLO violation for manifest read requests

Problem

Apdex SLO was violated for manifest read (HEAD or GET) requests in the registry's with Apdex dropping as low as 97.9%.

Impact

Manifest read requests for container registry were slower than expected, temporarily breaching the service level objective for user experience.

Key Discovery

The issue resolves immediately upon deployment, indicating this is an application-level bug rather than infrastructure-related. Analysis of the 30-day dashboard view shows:

  • SLI starts degrading rapidly when regular deployments are paused (e.g., weekends)
  • SLI recovers as soon as deployments resume
  • This pattern strongly suggests an ongoing source code bug introduced around the start of December, which auto resolves, or better said, drastically improves, when the application restarts

Investigation Details

  • Service: registry
  • SLI: server_route_manifest_reads
  • Alert Name: RegistryServiceServerRouteManifestReadsApdexSLOViolationRegional
  • Observed Symptoms: Two periods of increased database connection saturation and queuing
  • Performance Recovery: Automatic recovery without manual intervention (upon deployment)

Next Steps

  1. Review commits and changes introduced in early December
  2. Analyze database connection handling and query patterns
  3. Investigate potential memory leaks or resource accumulation over time
  4. Check for any changes to manifest read request handling logic
Edited by João Pereira