Add Prometheus metrics for Secret Detection partner token verification
What does this MR do and why?
This MR adds comprehensive Prometheus metrics for the Secret Detection partner token verification system to improve observability and enable proactive monitoring of external API integrations.
Problem: Currently, when GitLab verifies tokens with external partner APIs (AWS, GCP, Postman), we have limited visibility into:
- API response times and performance degradation
- Error rates and failure patterns
- Network connectivity issues
- Rate limiting behavior
This lack of observability makes it difficult to:
- Detect and diagnose issues before they impact users
- Understand which partners have reliability problems
- Optimize rate limiting configurations
- Provide SLOs for the feature
Solution: Implement four Prometheus metrics that track:
- API Duration - Response time histogram to identify latency issues
- API Requests - Success/failure counters with error classification
- Network Errors - Detailed error tracking by type
- Rate Limit Hits - Project-level rate limit monitoring
Related Issues
https://gitlab.com/gitlab-org/gitlab/-/issues/567735
Implementation Details
New Metrics Module
Created Gitlab::Metrics::SecretDetection::PartnerTokens
module with four metrics:
validity_check_partner_api_duration_seconds (Histogram)
Labels: partner
Buckets: [0.1, 0.25, 0.5, 1, 2, 5, 10]
validity_check_partner_api_requests_total (Counter)
Labels: partner, status, error_type
validity_check_network_errors_total (Counter)
Labels: partner, error_class
validity_check_rate_limit_hits_total (Counter)
Labels: limit_type
Integration Points
-
BaseClient - Records metrics for all partner API calls
- Duration tracking for complete verification cycle
- Success/failure tracking with error classification
- Network error categorization
-
PartnerTokensClient - Records rate limit hits
- Per-project rate limit tracking
- Detailed rate limit type identification
Metric Label Design
Partner values:
-
aws
- Amazon Web Services -
gcp
- Google Cloud Platform -
postman
- Postman API
Status values:
-
success
- Verification completed successfully -
failure
- Verification failed (see error_type)
Error type values:
-
none
- No error (success case) -
network_error
- Connection/timeout issues -
rate_limit
- Rate limit exceeded -
response_error
- Invalid/unparseable response
How to set up and validate locally
1. Enable the feature
# In rails console
Feature.enable(:secret_detection_partner_token_verification)
2. Configure a test project with Secret Detection
project = Project.find_by_full_path('your-namespace/your-project')
project.security_setting.update!(validity_checks_enabled: true)
3. Trigger token verification
Push a commit with a test AWS/GCP/Postman token to trigger the verification flow.
4. View metrics
Navigate to http://localhost:3000/-/metrics
and search for validity_check_
:
# HELP validity_check_partner_api_duration_seconds Partner API response time in seconds
# TYPE validity_check_partner_api_duration_seconds histogram
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="0.1"} 0
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="0.25"} 1
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="0.5"} 3
validity_check_partner_api_duration_seconds_sum{partner="aws"} 1.234
validity_check_partner_api_duration_seconds_count{partner="aws"} 5
# HELP validity_check_partner_api_requests_total Total partner API verification requests
# TYPE validity_check_partner_api_requests_total counter
validity_check_partner_api_requests_total{partner="aws",status="success",error_type="none"} 4
validity_check_partner_api_requests_total{partner="aws",status="failure",error_type="network_error"} 1
# HELP validity_check_network_errors_total Total network errors during partner API calls
# TYPE validity_check_network_errors_total counter
validity_check_network_errors_total{partner="aws",error_class="Timeout"} 1
# HELP validity_check_rate_limit_hits_total Total rate limit hits during token verification
# TYPE validity_check_rate_limit_hits_total counter
validity_check_rate_limit_hits_total{limit_type="partner_aws_api",project_id="123"} 2
Testing
Unit Tests
# Run metrics module specs
bundle exec rspec ee/spec/lib/gitlab/metrics/secret_detection/partner_tokens_spec.rb
# Run base client specs with metrics
bundle exec rspec ee/spec/lib/security/secret_detection/partner_tokens/base_client_spec.rb
# Run partner tokens client specs
bundle exec rspec ee/spec/lib/security/secret_detection/partner_tokens_client_spec.rb
Documentation
-
Updated doc/administration/monitoring/prometheus/gitlab_metrics.md
with new metrics -
Added dedicated section for Secret Detection partner token verification metrics -
Documented all labels and their possible values -
Provided alert threshold recommendations -
Added example alert rules reference
Metrics in Prometheus
Example metrics output from /-/metrics endpoint
# HELP validity_check_partner_api_duration_seconds Partner API response time in seconds
# TYPE validity_check_partner_api_duration_seconds histogram
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="0.1"} 45
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="0.25"} 89
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="0.5"} 142
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="1"} 178
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="2"} 185
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="5"} 187
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="10"} 187
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="+Inf"} 187
validity_check_partner_api_duration_seconds_sum{partner="aws"} 89.234
validity_check_partner_api_duration_seconds_count{partner="aws"} 187
validity_check_partner_api_duration_seconds_bucket{partner="gcp",le="0.1"} 12
validity_check_partner_api_duration_seconds_bucket{partner="gcp",le="0.25"} 34
...
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Availability and Testing
-
Feature flag added: Not required - metrics collection has minimal overhead -
Covered with tests (unit and integration) -
Tested in GDK environment -
Documentation updated
Performance
-
Evaluated metric cardinality - all labels are low/constant cardinality -
Overhead measured - < 1ms per verification -
No high-cardinality labels (project_id only used in rare rate limit cases)
Security
-
No sensitive data in metric labels -
No PII in metric values -
Token values never logged or exposed in metrics
Monitoring
-
Example alert rules provided -
Runbook considerations documented in metrics docs -
Labels designed for effective alerting and debugging
Merge Request Checklist
-
Assign to reviewer: @reviewer-username
-
Assign to maintainer: @maintainer-username
-
Add ~"workflow::ready for review"
label when ready -
Request review from Secure team: @gitlab-org/secure/secret-detection
-
Request review from monitoring expert: @gitlab-org/maintainers/observability