Skip to content

Add Prometheus metrics for Secret Detection partner token verification

What does this MR do and why?

This MR adds comprehensive Prometheus metrics for the Secret Detection partner token verification system to improve observability and enable proactive monitoring of external API integrations.

Problem: Currently, when GitLab verifies tokens with external partner APIs (AWS, GCP, Postman), we have limited visibility into:

  • API response times and performance degradation
  • Error rates and failure patterns
  • Network connectivity issues
  • Rate limiting behavior

This lack of observability makes it difficult to:

  • Detect and diagnose issues before they impact users
  • Understand which partners have reliability problems
  • Optimize rate limiting configurations
  • Provide SLOs for the feature

Solution: Implement four Prometheus metrics that track:

  1. API Duration - Response time histogram to identify latency issues
  2. API Requests - Success/failure counters with error classification
  3. Network Errors - Detailed error tracking by type
  4. Rate Limit Hits - Project-level rate limit monitoring

Related Issues

https://gitlab.com/gitlab-org/gitlab/-/issues/567735

Implementation Details

New Metrics Module

Created Gitlab::Metrics::SecretDetection::PartnerTokens module with four metrics:

validity_check_partner_api_duration_seconds (Histogram)
  Labels: partner
  Buckets: [0.1, 0.25, 0.5, 1, 2, 5, 10]
  
validity_check_partner_api_requests_total (Counter)
  Labels: partner, status, error_type
  
validity_check_network_errors_total (Counter)
  Labels: partner, error_class
  
validity_check_rate_limit_hits_total (Counter)
  Labels: limit_type

Integration Points

  1. BaseClient - Records metrics for all partner API calls
    • Duration tracking for complete verification cycle
    • Success/failure tracking with error classification
    • Network error categorization
  2. PartnerTokensClient - Records rate limit hits
    • Per-project rate limit tracking
    • Detailed rate limit type identification

Metric Label Design

Partner values:

  • aws - Amazon Web Services
  • gcp - Google Cloud Platform
  • postman - Postman API

Status values:

  • success - Verification completed successfully
  • failure - Verification failed (see error_type)

Error type values:

  • none - No error (success case)
  • network_error - Connection/timeout issues
  • rate_limit - Rate limit exceeded
  • response_error - Invalid/unparseable response

How to set up and validate locally

1. Enable the feature

# In rails console
Feature.enable(:secret_detection_partner_token_verification)

2. Configure a test project with Secret Detection

project = Project.find_by_full_path('your-namespace/your-project')
project.security_setting.update!(validity_checks_enabled: true)

3. Trigger token verification

Push a commit with a test AWS/GCP/Postman token to trigger the verification flow.

4. View metrics

Navigate to http://localhost:3000/-/metrics and search for validity_check_:

# HELP validity_check_partner_api_duration_seconds Partner API response time in seconds
# TYPE validity_check_partner_api_duration_seconds histogram
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="0.1"} 0
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="0.25"} 1
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="0.5"} 3
validity_check_partner_api_duration_seconds_sum{partner="aws"} 1.234
validity_check_partner_api_duration_seconds_count{partner="aws"} 5

# HELP validity_check_partner_api_requests_total Total partner API verification requests
# TYPE validity_check_partner_api_requests_total counter
validity_check_partner_api_requests_total{partner="aws",status="success",error_type="none"} 4
validity_check_partner_api_requests_total{partner="aws",status="failure",error_type="network_error"} 1

# HELP validity_check_network_errors_total Total network errors during partner API calls
# TYPE validity_check_network_errors_total counter
validity_check_network_errors_total{partner="aws",error_class="Timeout"} 1

# HELP validity_check_rate_limit_hits_total Total rate limit hits during token verification
# TYPE validity_check_rate_limit_hits_total counter
validity_check_rate_limit_hits_total{limit_type="partner_aws_api",project_id="123"} 2

Testing

Unit Tests

# Run metrics module specs
bundle exec rspec ee/spec/lib/gitlab/metrics/secret_detection/partner_tokens_spec.rb

# Run base client specs with metrics
bundle exec rspec ee/spec/lib/security/secret_detection/partner_tokens/base_client_spec.rb

# Run partner tokens client specs
bundle exec rspec ee/spec/lib/security/secret_detection/partner_tokens_client_spec.rb

Documentation

  • Updated doc/administration/monitoring/prometheus/gitlab_metrics.md with new metrics
  • Added dedicated section for Secret Detection partner token verification metrics
  • Documented all labels and their possible values
  • Provided alert threshold recommendations
  • Added example alert rules reference

Metrics in Prometheus

Example metrics output from /-/metrics endpoint
# HELP validity_check_partner_api_duration_seconds Partner API response time in seconds
# TYPE validity_check_partner_api_duration_seconds histogram
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="0.1"} 45
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="0.25"} 89
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="0.5"} 142
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="1"} 178
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="2"} 185
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="5"} 187
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="10"} 187
validity_check_partner_api_duration_seconds_bucket{partner="aws",le="+Inf"} 187
validity_check_partner_api_duration_seconds_sum{partner="aws"} 89.234
validity_check_partner_api_duration_seconds_count{partner="aws"} 187

validity_check_partner_api_duration_seconds_bucket{partner="gcp",le="0.1"} 12
validity_check_partner_api_duration_seconds_bucket{partner="gcp",le="0.25"} 34
...

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Availability and Testing

  • Feature flag added: Not required - metrics collection has minimal overhead
  • Covered with tests (unit and integration)
  • Tested in GDK environment
  • Documentation updated

Performance

  • Evaluated metric cardinality - all labels are low/constant cardinality
  • Overhead measured - < 1ms per verification
  • No high-cardinality labels (project_id only used in rare rate limit cases)

Security

  • No sensitive data in metric labels
  • No PII in metric values
  • Token values never logged or exposed in metrics

Monitoring

  • Example alert rules provided
  • Runbook considerations documented in metrics docs
  • Labels designed for effective alerting and debugging

Merge Request Checklist

  • Assign to reviewer: @reviewer-username
  • Assign to maintainer: @maintainer-username
  • Add ~"workflow::ready for review" label when ready
  • Request review from Secure team: @gitlab-org/secure/secret-detection
  • Request review from monitoring expert: @gitlab-org/maintainers/observability
Edited by Aditya Tiwari

Merge request reports

Loading