Platform: Add metrics and logs to query API

Problem to solve

We need to make sure we are monitoring the query API for:

  • API request/response times
  • API uptimes
  • Feature usage
  • Resource usage
  • Slow or expensive queries
  • Errors
  • Abuse/Rate limits

Technical considerations

As per the design document, we must consider:

  • All logs should be sent to Kibana.
  • All feature usage metrics should be sent to our Internal Analytics.
    • Note: Not sure how feasible this is from the platform, we may need to use Service Ping or the Events Tracking API?
  • All other metrics should be sent to Prometheus.
  • Logs must not store unnecessary personally identifiable information, secrets, or keys. All new logging calls must be checked to make sure we're not leaking information into our logs that we shouldn't.
  • ClickHouse queries must be logged in a redacted format, replacing placeholders with a ?. Query logs must have a hash of the query for easy comparison and finding of similar queries.
  • Errors must contribute to our Error Budget, so we can monitor improvements over time.

We might need to add the query API to the metric catalog too.

Edited by Robert Hunt