Make Geo metrics collection robust by handling query timeouts
What does this MR do and why?
Context from issue #471105 (closed)
Geo::MetricsUpdateWorkeris inherently unreliable in its current form. Its design is a root cause of many support tickets, especially for deployments with a lot of data.Every time a Geo count query fails, it blocks all Geo metrics collection and observability (example).
In this case, another team modified a relation, which lead to a batch-count-related query timing out for a customer (I think they have 10s of millions of job artifact rows). It is predictable that this class of problem will happen from time-to-time. Therefore metrics collection design should take it into account.Geo::MetricsUpdateWorker is inherently unreliable in its current form. Its design is a root cause of many support tickets, especially for deployments with a lot of data. Every time a Geo count query fails, it blocks all Geo metrics collection and observability (example). In this case, another team modified a relation, which lead to a batch-count-related query timing out for a customer (I think they have 10s of millions of job artifact rows). It is predictable that this class of problem will happen from time-to-time. Therefore metrics collection design should take it into account.
This MR addresses the above by:-
- Making Geo metrics collection more robust by
- handling query timeouts so that other count/metric queries can continue (these queries occur in our
GeoNodeStatusmodel - checking before each metric query if we're approaching the job ttl and raising an error to abort
- handling query timeouts so that other count/metric queries can continue (these queries occur in our
- Adding test coverage for the new GeoNodeStatus behaviour
Related to #471105 (closed)
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
- Ensure you have Geo set up on your GitLab instance
- Manually simulate 55 mins surpassing by editing the
geo_node_status.rbstarted_atattribute:-
def self.current_node_status(timeout: nil)
current_node = Gitlab::Geo.current_node
return unless current_node
status = current_node.find_or_build_status
if timeout
status.started_at = 55.minutes.ago
- Test the query timeout logic:-
bundle exec rails c
# Get current node status with timing that should trigger TTL warning
status = GeoNodeStatus.current_node_status(
timeout: 1.hour
)
# Should see JobTimeoutApproachingError
- Test the ActiveRecord::QueryCanceled logic by updating the
collect_metricsmethod ingeo_node_status.rb
def collect_metric(replicator, metric_name)
abort_if_job_ttl_approaching! if timeout # If timeout is present, we know this was called from the metrics worker
field_name = "#{replicator.replicable_name_plural}_#{metric_name}"
# Manually trigger error for first metric (count)
if metric_name == :count
raise ActiveRecord::QueryCanceled.new("ERROR: canceling statement due to statement timeout")
end
# Add debugging for second metric (checksummed_count)
if metric_name == :checksummed_count
binding.pry # or use Logger / puts to output something custom
end
value = yield
public_send("#{field_name}=", value) # rubocop:disable GitlabSecurity/PublicSend
rescue ActiveRecord::QueryCanceled => e
Gitlab::ErrorTracking.track_exception(e, extra: { metric: field_name })
end
Then run:-
bundle exec rails c
status = GeoNodeStatus.current_node_status(
timeout: 1.hour
)
Then when you run it:
- The first metric should fail with ActiveRecord::QueryCanceled
- You'll hit the pry debugger on the second metric to show that metrics collection continues