Make it easier for developers to solve SQL performance problems
Solving SQL performance problems can be quite tricky. For one you need good knowledge of SQL, but you also need things such as:
- Realistic (ish) testing data
- An environment that matches production
- Good tools for obtaining and reading explain plans, profilers, etc
While GitLab has a variety of tools available today (Grafana, URL profilers, the performance bar, etc), knowledge I think is still somewhat lacking. For example, I'm fairly confident that out of the 229 engineers (of which not all are backend), maybe less than 20 know how to read query plans, and maybe fewer know how to use this knowledge to improve queries.
One of the first steps we took to spread the knowledge more was documenting how to read EXPLAIN plans. While this is helpful, it only covers explain plans. I had some other ideas to spread knowledge more, such as:
- Training specific individuals, then having them train others (I can't possibly train 100-something backend engineers)
- Recording 20 minute screencasts that show how I would solve a particular problem
- Compiling a list of "approved" resources (e.g. books), and allowing developers to reimburse the costs of acquiring those resources
- Changing hiring requirements to include knowledge of SQL and experience with MySQL/PostgreSQL being required (the requirements would vary between junior/intermediate/senior/etc)
- Per team, train 1-2 people, who become this team's SQL/DB specialist
Since I have different opinions on these ideas on a daily basis, I think the first step is to figure out how many developers would need training. Based on this we can then decide what kind of approach works best.
Problems identified so far
- Lack of visibility for documentation: we have a lot of documentation, but it's not always easy to find the right documentation.
- MR https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/21797 cleans up some titles a little bit
- Lack of awareness/visibility about the monitoring tools we have, and how to use them
- Lack of enough test data for local environments
- Issues don't always contain enough data
- Our performance requirements are ill defined. For the DB we had the unwritten goal of 100 ms of SQL per request, but for many other things we use SiteSpeed and lord knows what else.
- Documentation titles are confusing (e.g. Bullet docs are hidden under "Profiling")
- Reviewers need some kind of quick checklist they can use when reviewing changes
- It's not always clear when to use Arel versus raw SQL