Reduce SQL statement timeouts from 30 seconds to 15 seconds per February 1st, 2018
In https://gitlab.com/gitlab-com/infrastructure/issues/2216 we investigated dropping SQL statement timeouts from 60 to 30 seconds, then from 30 to 15 seconds. Using a timeout of 15 seconds was leading to a few useful pages timing out, hence we have increased it again to 30 seconds.
Because 30 seconds is beyond ludicrous I want to permanently set the timeout back to 15 seconds per February 1st, 2018. This will give developers 2 months to fix the pages that we know timed out, which should be plenty of time. The only exception made would be for important pages timing out for lots of people, or something else equally serious.
This may come across as harsh, but keeping these queries running for up to 30 seconds forever can result in resources being wasted on those queries. This in turn can lead to other queries performing worse. We've delayed setting a stricter timeout for far too long, so I really want to put an end to this.
When
February 1st, 2018 around 12:00 UTC.
Who
At least the following people:
- @yorickpeterse
- @abrandl
- @_stark (if he's around / available due to FOSDEM)
Checklist / Steps
-
Collect the graphs that will help us understand the impact of this change, add links into this issue
- https://log.gitlap.com/goto/65714b407532a3fe50daaa596c79a46b
- https://prometheus.gitlab.com/graph?g0.range_input=6h&g0.expr=sum(irate(pg_stat_database_xact_commit%7Bdatname%3D%22gitlabhq_production%22%2C%20environment%3D%22prd%22%2C%20tier%3D%22db%22%2C%20type%3D%22postgres%22%7D%5B1m%5D))%20&g0.tab=0&g1.range_input=6h&g1.expr=sum(irate(pg_stat_database_xact_rollback%7Bdatname%3D%22gitlabhq_production%22%2C%20environment%3D%22prd%22%2C%20tier%3D%22db%22%2C%20type%3D%22postgres%22%7D%5B1m%5D))&g1.tab=0
-
Update the secondaries one by one, with a bit of time between them. -
Assuming all secondaries are behaving nicely, update the primary: done around 13:05 UTC -
Monitor the graphs collected in step 1 to see the impact -
Decide if we can keep the timeout or need to (temporarily) roll back
Current Timeouts
Per Grafana we hover between 200 and 500 slow (longer than 5 seconds) queries. The total number of queries per hour is around 130 million (at peak). This means that at peak less than 0.0003% of queries per hour may time out. Kibana/logstash shows that in the last 7 days the highest number of timeouts per day is 632.
In short: very few queries are timing out at the moment.
FAQ
Why February 1st?
Because when we decided upon a date for the deadline this gave us exactly 2 months (well 2 months and 1 week to be precise) to solve these issues, which means 2 releases. Further, 2 months should be more than enough.
Will you reconsider the deadline?
Only if GitLab.com's availability degrades (e.g. 30% of all requests now time out) as a result.
Why will you not reconsider the deadline?
Because a deadline is useless if you just move it up ahead the moment people realise they're not going to make it. On top of that 2 months should be more than enough, and what can't be fixed in 2 months most likely isn't going to be fixed in 6 months (or even more) for a simple reason: it means it's not getting prioritised / picked up / etc.
Why does this deadline exist?
Part of 2018's DB plan is to change how GitLab interacts with the database and how we solve database issues. In the past we were very passive: we'd basically wait for a problem, then try to fix it while everybody else continued their usual business. Starting with this deadline we'll be moving to a setup where the database team (and to a certain extend the production team) defines the rules / requirements of interacting with the database, and most importantly of all enforcing them. This is all done to ensure that GitLab.com actually becomes fast in 2018, as previous workflows / attempts were not as successful as they should have been.
When/how was this announced?
This plan was shared with the CI, Platform, and Discussion team leads (these teams were responsible for the work done, other teams such as e.g. Frontend simply didn't have any timeout issues to deal with) when the issue was created (see comments below). These leads were also notified one month ahead of time, and a little while after that.
Are there any follow-up plans?
Eventually we want to drop the statement timeout even further, e.g. to 5 seconds. There's no exact date for this yet.
What about Sidekiq?
Sidekiq will have to follow the same rules as otherwise it could still harm the system. Giving Sidekiq a dedicated timeout is also tricky because of it connecting to the database via pgbouncer. Since this is basically cheating ("it's in the background so it's OK to be slow as a snail and potentially harm the system") we won't bother with this unless absolutely necessary.