Discussion: Share slow database query information with stage groups
Background
Many of the corrective measures being put in place to address the database cpu saturation warnings are focusing on the problem from the Infrastructure perspective. There are some issues raised to look at SQL optimization and specific queries, but these have been done by a central team with limited time. There may be other optimizations to be had from the Application perspective.
The problem is that the database is a shared resouce but the onus is on a central team to monitor, optimize and scale that service. It may be more efficient for some of these tasks to be shared out and handled by the stage group teams who hold the domain knowledge about how they are using the database. However, stage group teams don't know where to look for information and they don't know what information they need to act on.
What already exists?
- We already have the mapping between stage groups and feature categories, and we have the mapping between feature categories and endpoints.
- The endpoint_id is stored on each request and can be found in the logs.
- More information about mapping postgres statements is available here in the runbooks.
Proposal
For a first iteration, give stage groups access to information about slow queries that their feature categories are responsible for. At this stage it's unclear where the best place for this information would be.
Questions
I'd like us to discuss how practical it is to solve this problem and how beneficial this might be so that we can decide if this should be a project and who is best placed to implement this if we go head. Some questions to get the discussion started:
- What would we expect people to do with this information?
- Some people might ask why this isn't the same as the error budget information for the endpoint - how do we address that? Can they be the same?
- How much effort is the simplest thing?