Log duration and context from a random sample of database queries
Gitlab currently has great instrumentation for which queries cause load on the database, in the form of pg_stat_statements, pg_stat_kcache, and other database utilites.
However, one query can be called from several places in the application, and these utilites are unable to distinguish between different callsites.
Add low-frequency sampled logging to all query executions in rails, with enough information to identify the proportion of query calls each instrumented callsite is responsible for:
- a stacktrace
- the current sidekiq worker, if any
- the current web request path, if any
- the duration of the query
- a fingerprint for the query
This is far too much data to log without sampling, so instead of logging all of it, log a low percentage of queries. The average gitlab page makes less than 100 database queries, so start at 1/10000 ratio logging so we log roughly once every 100 requests, and then bump the ratio up to at most 1 / 100 depending on log volume.
We can correlate these stacktraces with data from pg_stat_statements to attribute different web endpoints and sidekiq workers to a percentage of total database load.