Large numbers of enqueued PagesDomainVerificationWorker jobs
"Sequential scans on pages_domains"
A fix for the first has been written and is going through review. It should help with the second by allowing a subset of domains to become verified, reducing the frequency with which they're scheduled from every 15 minutes to every 4 days or so.
EXPLAIN for: SELECT "pages_domains".* FROM "pages_domains" WHERE ("pages_domains"."verified_at" IS NULL OR ("pages_domains"."enabled_until" IS NULL OR "pages_domains"."enabled_until" < '2018-03-30 10:21:47.547930')) QUERY PLAN-------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on pages_domains (cost=0.00..689.21 rows=14411 width=137) Filter: ((verified_at IS NULL) OR (enabled_until IS NULL) OR (enabled_until < '2018-03-30 10:21:47.54793+00'::timestamp with time zone))(2 rows)
(find_each just calls this scope successively with id-based LIMIT and OFFSET parameters).
So.... why's it doing a seq scan? What's missing? Adding a bare enabled_until index doesn't change things.
For the large number of enqueued jobs, I think we need to introduce some form of backoff. Retrying every 15 minutes forever isn't a good solution. I suggest:
Retry every fifteen minutes for the first 6 hours
Go to hourly retries for the next 18
Go to daily retries after that
We can reset this every time a state change happens (e.g., verified -> unverified, enabled -> disabled, or vice-versa).
Once verified, a domain is exempted from retries for four straight days, but it seems we have large numbers of domains that are disabled, and there's no incentive to remove them from the database at present.
So that could be an alternative - a cleanup worker that removes them if they're disabled for more than a week.
We currently have 11,759 domains in the needs_verification scope. 10,438 of them have enabled_until: nil, indicating that they're currently disabled...
Since there's < 20,000 rows in the pages_domains table, PagesDomain.needs_verification is super-fast for me. Perhaps the planner simply disdains the indexes because of the size of the dataset?
The cleanup worker has a pleasing characteristic in that if someone tries to take control of a domain they don't own, this prevents them from keeping control of it (in the sense of it being associated to their gitlab project) indefinitely. Removal would be associated with an email and a log entry, as disablement currently is; we'd probably have to add a remove_at field to drive it.
@nick.thomas It's probably doing a sequence scan because WHERE x IS NULL can't use an index (unless you have a partial index on WHERE x IS NULL. enabled_until is at the end of a composite index, which also means PostgreSQL can't use it.
The verified_at part is something I don't see us being able to optimise because there are no other conditionals to AND on top of it. The enabled_until column is also hard to optimise. If we end up always using a WHERE on id for pagination purposes you could add an index like id WHERE verified_at IS NULL, but I'm not sure how beneficial this would be in this case.
In both cases a simple "hack" would be to use a date far in the past (e.g. 01-01-1900) instead of NULL values. This would allow us to re-use the regular indexes.
I don't think this is a case where UNION is actually helpful. Postgres is clever enough to use two different indexes on the same table and union the results using a BitmapOr plan node. (It's not clever enough to do a join for one side of the union but in this case all the information it needs is in indexes on this table).
I think there just isn't an index available that can handle the second half of that OR at all. Or if there is Postgres is deciding that too large a portion of the table is going to be read and it'll be quicker to do a sequential scan.
It's probably doing a sequence scan because WHERE x IS NULL can't use an index (unless you have a partial index on WHERE x IS NULL.
This is not true. Postgres can use indexes to satisfy IS NULL and will if it thinks it's a small enough portion of the table. It was true once (sometime back around 7.3 or so) but it definitely isn't any more:
stark=# explain select * from words where t is null;┌─────────────────────────────────────────────────────────────────┐│ QUERY PLAN │├─────────────────────────────────────────────────────────────────┤│ Index Scan using wt on words (cost=0.42..4.44 rows=1 width=14) ││ Index Cond: (t IS NULL) │└─────────────────────────────────────────────────────────────────┘(2 rows)
enabled_until is at the end of a composite index, which also means PostgreSQL can't use it.
I think this is the real problem.
It's also possible that there is an index it can use but due to the other problems with the code Postgres just thinks too large a portion of the table needs to be read and it will be quicker to read it in a sequential scan. If that's a common situation then planning to page through all the ids in the table may be a more scalable approach. In which case you don't need any indexes other than id.
@_stark adding a separate index just on enabled_until didn't cause postgresql to switch from seq scan, so I suspect it's just that the query planner is happy with the table size.
I'm all for not bothering to address it at present and focusing on the second (non-database-related) bullet point if you're happy with the current queries and plans?
The cleanup worker has a pleasing characteristic in that if someone tries to take control of a domain they don't own, this prevents them from keeping control of it (in the sense of it being associated to their gitlab project) indefinitely. Removal would be associated with an email and a log entry, as disablement currently is; we'd probably have to add a remove_at field to drive it.
@nick.thomas it sounds good to me, we can set this to a reasonable value and avoid "infinite" waits.
@bikebilly awesome, can this be scheduled via usual means? I've got a couple of ~Geo issues that really need my immediate attention. I'm not sure how much pain the current behaviour is causing GitLab.com at present - can you comment @northrup?
This item has been automatically determined to have 4 or fewer upvotes and is, for now, being marked as awaiting further demand. If you feel this is incorrect, please comment on the issue and I'll be happy to take a look.
@jlenny one thing noticed during a GCP Migration production dry run is that the sidekiq jobs for verification can take up to 10 minutes to execute. This is much longer than I ever expected, and probably part of the problem.
Just checking the differences between the two environments - we mostly use 8.8.8.8 and 8.8.4.4 on both environments (there's one other IP that differs between the two) so there shouldn't be a change.
I noticed this when looking into various performance/latency issues with Unicorn.
This cron job with no randomized duration or rate limiting causes spikes of 5-10k sidekiq requests in a 30 second window. This is mostly hitting the best effort queue, so it's not big problem, but it would make sense to spread this work out over time.
Once verified, a domain is exempted from retries for four straight days, but it seems we have large numbers of domains that are disabled, and there's no incentive to remove them from the database at present.
So that could be an alternative - a cleanup worker that removes them if they're disabled for more than a week.
We currently have 11,759 domains in the needs_verification scope. 10,438 of them have enabled_until: nil, indicating that they're currently disabled...
Deleting custom domains that match the second query when they've not been updated for over a week should be quite simple to add to the existing worker, and will reduce the volume of verification attempts to somewhere much closer to zero.
We could remove them directly on GitLab.com if this is causing us capacity or other issues.
This doesn't feel like a large amount of work. @jlenny what are the product/support implications in deleting custom domains that users have added? It seems like if a domain can't be verified for a week then the user doesn't care about it anymore, but I could be wrong.
We should deprecate it asap. I'm going to add to the 11.8 blog post, mentioning we'll be deprecating this in %11.10. Please ping me if I should not do that.
fyi: this is mostly merged, we need to wait till release will get to production, check if domains are being marked for removal properly, enable feature flag, and then remove feature flag.