DB high load on May 25 - outage

What happened

At 9:58 UTC we got an alert of high load in DB4 (over 300)
We ssh'd into the DB server
We opened a hangout and @jacobvosmaer-gitlab @jnijhof and I joined in.
We gathered data and queries (lots of queries)
We asked @yorickpeterse to join the hangout as he was also ssh'd into the server.
After asking @yorickpeterse through Slack if we could take action or if he was doing something, we decided to restart the DB.
The restart did not decreased the load, after a bit we got high load again. (~10:14 UTC)
We thought that the issue could have been that we didn't vacuum the issues table (all the queries were about the issues table)
We started a manual vacuum on the issues table. (~10:22 UTC)
The vacuum was not moving after a while and the load was still high.
We bounced again the DB and started vacuum right away.
After just vacuuming 2 indexes, the process was also not moving.
We decided to bring up the deploy page to cut requests for a while (~10:49 UTC)
The vacuuming progressed and finished fine.
We took the deploy page down (~10:54 UTC)
Since then load is normal, we monitored for a couple of minutes. (~10:59 UTC)
We then enabled auto-vacuum and logging of it so we can monitor and rule it out completely (~11:06 UTC)

We can't blame auto-vacuuming as the source of the problem as it was manually disabled.
There was a spiked increase in requests in the load balancers, after a brief discussion we are assuming that it is because of the contention happening in the database (requests are being queued)
We are guessing that there may be a set of queries that are locking the database, we don't have strong enough evidence of it being so yet.

The deploy page cut the load right away and allowed the database to finish whatever was locking it. We should do this sooner.
Having a script in the root partition to dump slow queries that are in flight, we should have more of these and document them.

We don't had a quick way of setting a rate limit, we discussed it but abandoned the idea.
We don't have a good way of understanding if we are being hammered by a bot, or if all the requests are coming from the same ip. We need to improve our monitoring for the connection between the load balancers and the application: we need to check in way too many places to understand what's going on.