Improve the performance of the geo node status endpoint
Calculating geo node status is a slow operation, consisting of a large number of slow SQL queries running sequentially.
These SQL queries are slow because they have to cross-reference data in the "main" SQL database with data in the ~Geo tracking database. This necessarily involves transferring hundreds of thousands or millions of rows in every query with the current approach.
Long-term solutions to the problem include:
I think there's a short-term optimization we can implement, though.
Currently, we're pulling interesting IDs via FDW once per query. Instead, we could create a snapshot of the data (using a materialized view or anything else) at the start of the status query, perform all the queries against that snapshot, and then remove the snapshot at the end of the query.
for instance, we're currently using the list of disabled wikis in multiple queries - every query that looks into the status of wikis (synced, unsynced, failed and verified, at first thought) needs this data, which is 160K project IDs.
If transactions work normally with materialized views, we could instead do something like:
BEGIN TRANSACTION
-- fdw case
CREATE TEMPORARY TABLE disabled_wikis_xxx
ON COMMIT DROP
AS (SELECT project_id FROM gitlab_secondary.project_features WHERE wiki_access_level = 0 OR wiki_access_level IS NULL)
-- In the legacy case, use the cross-database pluck to populate this table instead
-- these are all the same whether using legacy or fdw queries \o/
-- synced
SELECT count(*) FROM project_registry ...WHERE NOT EXIST(SELECT 1 from disabled_wikis_xxx where disabled_wikis.project_id = project_registry.project_id)
-- unsynced
SELECT count(*) FROM project_registry ...WHERE NOT EXIST(SELECT 1 from disabled_wikis_xxx where disabled_wikis.project_id = project_registry.project_id)
-- failed
SELECT count(*) FROM project_registry ...WHERE NOT EXIST(SELECT 1 from disabled_wikis_xxx where disabled_wikis.project_id = project_registry.project_id)
-- verified
SELECT count(*) FROM project_registry ...WHERE NOT EXIST(SELECT 1 from disabled_wikis_xxx where disabled_wikis.project_id = project_registry.project_id)
-- deletes the temporary table
COMMIT
...and so on.
We might be able to move from WHERE NOT EXIST
to a more-normal join with this approach for the wiki case.
The general form should be valid for all the other count queries we do during a status query. So we create a few temporary entries, perform a bunch of SELECTs with different conditions, then throw away the tables
All this means that status counts become:
- More consistent, since we're operating on a snapshot of the data for all queries
- Faster, since we only have to pull each item of data once
- More consistent between legacy and FDW, since we can use the same queries in the selects, and just create the lookup table differently
Throwing away the table after every status query operation means we don't have to address the invalidation questions raised in https://gitlab.com/gitlab-org/gitlab-ee/issues/5398. We can solve improving the performance of the "find projects or files to work on" queries separately, at which point we just remove the temporary tables from these queries in favour of the newly-added ones.