High volume of Redis calls in /api/:version/internal/kubernetes/usage_metrics endpoint causing performance degradation

Summary

The /api/:version/internal/kubernetes/usage_metrics endpoint is generating an excessive number of Redis calls, leading to slow server performance. This is likely due to inefficient implementation of the increment_count_events method in the AgentHelpers module

Steps to reproduce

  1. Send a POST request to the /api/:version/internal/kubernetes/usage_metrics endpoint
  2. Include a payload with multiple counters and high increment values

Example Project

What is the current bug behavior?

The endpoint is making an extremely high number of Redis calls (observed 227,255 in one instance for our customer) Server performance is the degraded, with high load and response times

What is the expected correct behavior?

  • The endpoint should process the metrics efficiently with a minimal number of Redis calls
  • Server response time should be within acceptable limits
  • Redis operations should be batched or optimized to reduce overall Redis duration

Relevant logs and/or screenshots

Shared with us by a customer (sensitive data omitted):

  "severity": "INFO",
  "duration_s": 46.00495,
  "db_duration_s": 0.00028,
  "view_duration_s": 46.00467,
  "route": "/api/:version/internal/kubernetes/usage_metrics",
  "queue_duration_s": 0.006702,
  "redis_calls": 227255,
  "redis_duration_s": 12.201167,
  "redis_read_bytes": 1969920,
  "redis_write_bytes": 12688412,
  "redis_feature_flag_calls": 1,
  "redis_feature_flag_duration_s": 0.000092,
  "redis_feature_flag_read_bytes": 404,
  "redis_feature_flag_write_bytes": 35,
  "redis_shared_state_calls": 227254,
  "redis_shared_state_duration_s": 12.201075,
  "redis_shared_state_read_bytes": 1969516,
  "redis_shared_state_write_bytes": 12688377,

gitlab-kas times out at 20s and logs the following:

  {"time":"<timestamp>","level":"ERROR","msg":"Failed to send usage data","mod_name":"usage_metrics","error":"Post \"https://<hostname>/api/v4/internal/kubernetes/usage_metrics\": http2: timeout awaiting response headers"}

Output of checks

This was reported after an upgrade to v17.4.1-ee.

Possible fixes

I think the issue is occurring here: https://gitlab.com/gitlab-org/gitlab/-/blob/17-4-stable-ee/lib/api/internal/kubernetes.rb?ref_type=heads#L139-148.

This method:

  • Iterates over each counter in the counters parameter.
  • For each counter, it calls Gitlab::InternalEvents.track_event(event)incr number of times.
  • This means that for each non-zero counter, it's making incr number of individual calls to track_event which updates redis.

This was introduced with 2d5c55e4.

I think we should aggregate the events into a hash then make a single call to a new method track_events.

Workaround

Setting gitlab_kas['metrics_usage_reporting_period'] = 0 followed by a reconfigure will disable the usage reporting thus work around the issue.