Automatic reindexing of standard indexes (#3989) · Epics · GitLab.org

Automatic reindexing of standard indexes

Due to Postgres' MVCC, physical database storage tends to get bloated over time. On GitLab.com, we see index bloat build up over time whereas table bloat is usually not a problem. Below we can see index and table bloat in GB over time (3 months). During this period, index bloat has about doubled and makes up a total of 500 GB at the moment. This is 500 GB we don't need and - worse - which impact performance: Pages have to be read and kept in memory unnecessarily, indexes are a lot larger than they need to be. ![Screenshot_from_2020-03-19_15-56-49](/uploads/bfc735e763c5aea5dbd56933475667ac/Screenshot_from_2020-03-19_15-56-49.png) A countermeasure is to run index repacking regularly. This employs `pg_repack` (we have a bit of automation around it with [`gitlab-pgrepack`](https://gitlab.com/gitlab-com/gl-infra/gitlab-pgrepack)) which is able to repack indexes (including primary key/unique indexes) and tables in the background. It typically needs operational supervision as it can cause lock build-ups and therefore should be watched closely (we haven't had a bad situation coming from it though yet). Motivating and scheduling these repacking sessions doesn't come for free and there is regular overhead coming from this, too. #### Proposal The proposal here is to implemented automatic background indexing in the application. There are two options for this: 1. `<PG12`: Use concurrent index rebuilds and target non-unique/non-primary indexes only. 1. `>= PG 12`: Concurrent re-indexing is supported, so we can target all indexes. Since we're not on >= PG 12 yet, we can only start with (1) and exclude unique/primary indexes here. Taking a look at [bloat estimation results](https://gitlab.com/snippets/1955728), we can see that only about 21% of bloat are attributed to unique/primary key indexes: ``` is_unique | bloat_size | real_size -----------+--------------+--------------- f | 458754277376 | 1302555254784 t | 122086989824 | 486675537920 (2 rows) ``` By just concentrating on normal indexes, we can start doing index rebuilds in the background today and this targets about 80% of the problem. ##### Rebuild process We propose a background job that is being executed at regular intervals (cronjob). This job performs bloat estimation (similar to https://gitlab.com/snippets/1955728) and schedules index rebuilds for selected indexes (TOP 10 by bloat ratio or bloat size or similar). A single index rebuild for an index `index` on table `table` takes the following steps (in a single background job): 1. `CREATE INDEX CONCURRENTLY` to rebuild the index in background using a temporary name `temp` 1. Execute below script in a transaction to swap indexes ```sql BEGIN; LOCK TABLE table IN ACCESS EXCLUSIVE MODE; ALTER INDEX index RENAME TO index_old; ALTER INDEX temp RENAME TO index; DROP INDEX index_old; COMMIT; ``` Locking: The DDL steps in script (2) require `AccessExclusive` locks on the indexes and the table owning the index. This conflicts with any reads, writes and other DDL on this table. Once the lock is granted, the transaction finishes quickly and releases the lock. In order to safeguard the effects of the exclusive lock, we would employ a retry-strategy with increasing levels of `lock_timeout` ([`WithLockRetries`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/database/with_lock_retries.rb#L5)). ##### Scheduling Index rebuilds generate load on the database and shouldn't be done all at once. For starters, we would run the scheduling once a day and perform a low number of index rebuilds spread out across the day. We use a Redis exclusive lock to guard against concurrently running index rebuilds. ##### Reliability 1. `CREATE INDEX CONCURRENTLY` can fail at any time and leave back an `INVALID` index. This should be detected and recovered from (drop the invalid index) 1. If we fail to swap the index due to lock timeouts, we cleanup and drop the newly created index again. 1. Bloat estimation is a rather heavy process and it's unlikely we can use the [existing method](https://gitlab.com/snippets/1955728) directly. Perhaps this can be done iteratively and each index bloat is estimated separately. Another approach (for starters) is to skip the estimation and randomly pick large indexes and let them rebuild. This lowers index bloat levels in the long run, too, but perhaps not as efficiently as with a choice based on bloat estimates. ##### Dog fooding We have used `pg_repack` from time to time on GitLab.com. This approach hasn't been automated and it's not available directly for self-hosted installations. With this proposal, self-hosted installations also benefit from this feature and this increases database efficiency for everybody. #### Outlook PostgreSQL 12 not only comes with the ability to concurrently re-index existing indexes, but also has seen a lot improvements to index storage design which are deemed to have a massive effect on index bloat. It remains to be seen if this solves our problems with index bloat automatically or if we still need to do automatic index rebuilds from time to time. Additionally, we may be able to tune autovacuum settings better for GitLab.com. However, it's unlikely this is going to fix all bloat-related issues. Furthermore, it's unlikely this would ship with GitLab directly to the benefit of self-hosted installations. ## Iteration plan 1. Implement a Rake task that recreates the top-N worst indexes (based on estimation) 1. For starters, run it manually on GitLab.com 1. Optional follow-up: Execute it automatically after every `db:migrate` (on deploy) 1. Follow-up: For omnibus, setup a regular cronjob to execute it

epic