Database queries appear to be slow as the result of what appears to be an unreliable network connection
When poking around Grafana as part of https://gitlab.com/gitlab-org/gitlab-ce/issues/36878 I noticed that method calls to User.find_by_any_email
consistently take around 100 milliseconds:
This caught my attention because I knew the method's query is supposed to be very fast (I wrote it). Running these queries (using different Emails) and looking at the output of EXPLAIN (ANALYZE, BUFFERS)
confirmed that the queries themselves are fine. In psql
they would always complete in a millisecond or two, but when running those same queries from the deploy node they would sometimes take as much as 100 milliseconds to complete.
To narrow down the problem I wrote the following code:
def measure(email)
sql = User.by_any_email(email).select('users.*, now() AS current_time').to_sql
start = Time.zone.now
row = ActiveRecord::Base.connection.execute(sql).first
stop = Time.zone.now
if row
row_time = Time.parse(row['current_time'])
send_time = (row_time - start) * 1000.0
receive_time = (stop - row_time) * 1000.0
total_time = (stop - start) * 1000.0
return if total_time <= 10.0
puts "Send time: #{send_time} ms"
puts "Receive time: #{receive_time} ms"
puts "Total time: #{total_time} ms"
puts
end
end
This method attempts to get a rough estimate of the time spent in sending the query, executing it, and receiving the results. This doesn't include overhead of parsing rows into Rails models as I found that adding this doesn't really influence the timings.
With this method I ran my test as follows
1000.times { measure('sytse@gitlab.com') }
This then produces:
Send time: 15.220011 ms
Receive time: 2.027763 ms
Total time: 17.247774 ms
Send time: 0.77491 ms
Receive time: 9.964436000000001 ms
Total time: 10.739346000000001 ms
Send time: 6.514159 ms
Receive time: 4.4296880000000005 ms
Total time: 10.943847 ms
Send time: 4.665090999999999 ms
Receive time: 6.047655 ms
Total time: 10.712746000000001 ms
Send time: 0.610641 ms
Receive time: 102.216798 ms
Total time: 102.82743900000001 ms
Send time: 0.731703 ms
Receive time: 198.225447 ms
Total time: 198.95715 ms
Send time: 2.616959 ms
Receive time: 299.19583 ms
Total time: 301.812789 ms
Send time: 0.747465 ms
Receive time: 10.587084 ms
Total time: 11.334548999999999 ms
Here we can see that in all cases there's very little time between sending the query and it getting executed. However, in the worst case there can be up to 300 milliseconds spent in receiving the result and parsing it into a basic Ruby Hash (without any type casting).
This leads me to believe this query (and possibly many others) are slowed down by one of the following factors:
- Network overhead (e.g. perhaps our network is not powerful enough)
-
pgbouncer not having enough connections (this seems unlikely since the primary connections hardly ever go above 20, but it's possible)- deploy node connects to the DB directly
- PostgreSQL somehow not being able to send the results back fast enough
- Slow disks for the primary, delaying the time it takes to fetch data
Unfortunately this means there's still a lot of possible factors that may lead to slower queries, though at least in this case it does not appear to be the query itself.