run ANALYZE as part of database restore

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Release notes

Problem to solve

When the GitLab database is restored from backup, there are no statistics to ensure accurate query plans are calculated.

PostgreSQL documentation recommends running ANALYZE following a database restore.

This would ensure the database is performant when customers then tried to use the restored database; it might also speed up the restore process for other components that follow after the database.

Intended users

The user who is supporting GitLab self managed.

User experience goal

Proposal

Further details

PostgreSQL performance can degrade significantly on customer's self managed instances if there are not accurate statistics to determine the best query plan.

We know this from customers on AWS (example ticket, internal link) who performed a major upgrade to PostgreSQL and missed this step.

Whether certain tables had statistics was checked as part of an investigation with a customer running the charts deployment (internal link)

  • database restore took 20 minutes
  • repository restore took 3+ hours
  • statistics were queried (SELECT schemaname, relname, last_analyze FROM pg_stat_all_tables WHERE relname in('ci_builds','ci_build_trace_chunks','ci_job_artifacts');)
  • no entries for those tables

This is likely a side effect of gitlab:db:drop_tables (called by both the Chart backup-utility script and the rake task)

The criteria autovacuum uses to recreate statistics doesn't seem to be triggering in a timely fashion; from the v12 docs:

The autovacuum daemon, if enabled, will automatically issue ANALYZE commands whenever the content of a table has changed sufficiently. However, administrators might prefer to rely on manually-scheduled ANALYZE operations, particularly if it is known that update activity on a table will not affect the statistics of “interesting” columns. The daemon schedules ANALYZE strictly as a function of the number of rows inserted or updated; it has no knowledge of whether that will lead to meaningful statistical changes.

PostgreSQL documentation recommends running an ANALYZE after a restore, or similar operations

v12 pg_restore

Once restored, it is wise to run ANALYZE on each restored table so the optimizer has useful statistics

v12 populate database docs

Whenever you have significantly altered the distribution of data within a table, running ANALYZE is strongly recommended. This includes bulk loading large amounts of data into the table. Running ANALYZE (or VACUUM ANALYZE) ensures that the planner has up-to-date statistics about the table. With no statistics or obsolete statistics, the planner might make poor decisions during query planning, leading to poor performance on any tables with inaccurate or nonexistent statistics. Note that if the autovacuum daemon is enabled, it might run ANALYZE automatically

There is more than one way to achieve this goal.

  • SQL ANALYZE is single-threaded, so will take longer
  • vacuumdb --analyze would be preferable
    • it can multithread using -j (eg: during a maintenance window, -j $CPU_COUNT)
    • --analyze-in-stages can be used to create rough statistics to permit processing to start with reduce performance

Performing the ANALYZE in gitlab:backup:db:restore (by whichever method is easiest (as a first iteration)) would ensure that both the rake task and Charts would run the code.

Permissions and Security

Documentation

Availability & Testing

What does success look like, and how can we measure that?

What is the type of buyer?

Is this a cross-stage feature?

Links / references

Edited by 🤖 GitLab Bot 🤖