Specific training to SREs: how to interpret slow queries and connect them to upstream traffic, controllers and endpoints

Stemming from this incident, production#2885 (closed), we realised we could be faster and more effective at interpreting slow queries and specify query patterns running in our Postgres DB Cluster, leading to mitigation actions as soon as possible.

More specifically, this training should include:

  • Identifying Network Traffic coming from upstream (outside of the DB). Including leader + replicas.
  • Slow queries in the leader + replicas - identify and interpret them quickly.
  • Check errors in the Logs too, correlating with the above.
  • Correlating slow db queries with the application queries or controllers causing them.
  • (complete list)

We have some runbooks covering some of these topics, so we should start by leveraging them and continue addressing the gaps.

Edited by Henri Philipps