Make Geo metrics collection robust by handling query timeouts

What does this MR do and why?

Context from issue #471105 (closed)

Geo::MetricsUpdateWorker is inherently unreliable in its current form. Its design is a root cause of many support tickets, especially for deployments with a lot of data.

Every time a Geo count query fails, it blocks all Geo metrics collection and observability (example).

In this case, another team modified a relation, which lead to a batch-count-related query timing out for a customer (I think they have 10s of millions of job artifact rows). It is predictable that this class of problem will happen from time-to-time. Therefore metrics collection design should take it into account.Geo::MetricsUpdateWorker is inherently unreliable in its current form. Its design is a root cause of many support tickets, especially for deployments with a lot of data. Every time a Geo count query fails, it blocks all Geo metrics collection and observability (example). In this case, another team modified a relation, which lead to a batch-count-related query timing out for a customer (I think they have 10s of millions of job artifact rows). It is predictable that this class of problem will happen from time-to-time. Therefore metrics collection design should take it into account.

This MR addresses the above by:-

  • Making Geo metrics collection more robust by
    • handling query timeouts so that other count/metric queries can continue (these queries occur in our GeoNodeStatus model
    • checking before each metric query if we're approaching the job ttl and raising an error to abort
  • Adding test coverage for the new GeoNodeStatus behaviour

Related to #471105 (closed)

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

  1. Ensure you have Geo set up on your GitLab instance
  2. Manually simulate 55 mins surpassing by editing the geo_node_status.rb started_at attribute:-
  def self.current_node_status(timeout: nil)
    current_node = Gitlab::Geo.current_node
    return unless current_node

    status = current_node.find_or_build_status

    if timeout
      status.started_at = 55.minutes.ago
  1. Test the query timeout logic:-
bundle exec rails c
# Get current node status with timing that should trigger TTL warning
status = GeoNodeStatus.current_node_status(
  timeout: 1.hour
)
# Should see JobTimeoutApproachingError
  1. Test the ActiveRecord::QueryCanceled logic by updating the collect_metrics method in geo_node_status.rb
def collect_metric(replicator, metric_name)
  abort_if_job_ttl_approaching! if timeout # If timeout is present, we know this was called from the metrics worker

  field_name = "#{replicator.replicable_name_plural}_#{metric_name}"
  
  # Manually trigger error for first metric (count)
  if metric_name == :count
    raise ActiveRecord::QueryCanceled.new("ERROR: canceling statement due to statement timeout")
  end

  # Add debugging for second metric (checksummed_count)
  if metric_name == :checksummed_count
    binding.pry # or use Logger / puts to output something custom
  end

  value = yield
  public_send("#{field_name}=", value) # rubocop:disable GitlabSecurity/PublicSend
rescue ActiveRecord::QueryCanceled => e
  Gitlab::ErrorTracking.track_exception(e, extra: { metric: field_name })
end

Then run:-

bundle exec rails c

status = GeoNodeStatus.current_node_status(
  timeout: 1.hour
)

Then when you run it:

  • The first metric should fail with ActiveRecord::QueryCanceled
  • You'll hit the pry debugger on the second metric to show that metrics collection continues
Edited by Scott Murray

Merge request reports

Loading