PostgreSQL Locks - Luminus Alabi

module-name: "PostgreSQL locks"
area: "Product Knowledge"
gitlab-group: "Database"
maintainers:
  - bprescott_

Training name and description
Prerequisites
Create Your Module
Introduction
- Where to ask for help
Prepare test system
Theory
Locks block a reindexing operation.
Locks block a GitLab database migration
Be careful with locks
Completion

Training name and description

Goal: You understand how PostgreSQL uses locks, and why operations GitLab performs during upgrades such as creating indexes might stall because of lock acquisition.

Length: 1-2 hours

Objectives: At the end of this module, you should be able to:

Investigate what sessions are live on the database.
Find out what locks are in place.
Find out what locks have been requested.
Spot the messages about advisory locks when migrations are run.

Prerequisites

A single node Omnibus installation: if your test environment is scaled out or deployed in K8S, that will work as well, but you will need to figure out how to achieve some steps.

Ideally, this should be running GitLab 17.x: a specific migration is used as part of this training to remove and add an index.

This migration (and index) needs to be present in the code base for all the training to work.
At some point the migration will be removed from the code base, and this module will need to be updated with an alternative.
Any version of Gitlab will be fine for the first and third activities.

Create Your Module

Create an issue using this template by making the Issue Title: PostgreSQL locks training - <your name>.
Add yourself as the assignee.

Introduction

A number of customers upgrading to GitLab releases around 16.11 and 17.0 ran into issues with long running database migrations.

A very common root cause was that autovacuum had selected a table for work, and this was preventing GitLab migrations from making modifications to that table.

During normal operations of GitLab, locks can cause issues for instances, such as

Performance issues as queries are delayed while a lock is obtained.
Sidekiq runs out of worker threads because lots of the configured threads are queuing for a lock.

Where to ask for help

If at any time, you have questions, ask in slack: #spt_pod_database.

Prepare test system

These steps can be done in parallel with the theory section below.

Boot up your test machine.
Shut down Puma, Sidekiq, Gitlab exporter, PostgreSQL exporter.
- The main tasks in this module are performed with no other clients on the database. It'll be easier to see what's going on.
- To run migrations, Rails will need all backend components.
- On a single node install, this is what's needed:
```
gitlab-ctl stop
gitlab-ctl start postgresql
gitlab-ctl start gitaly
gitlab-ctl start redis
```
Start three database console sessions.
```
sudo gitlab-psql
```

Theory

Read about vacuuming in PostgreSQL
- Introduction, and the Section: 'Vacuuming Basics'.
- Section: 'Recovering Disk Space'.
- Section: 'Updating Planner Statistics'.
- The next three sections are not requied reading: 'Updating the Visibility Map', 'Preventing Transaction ID Wraparound Failures', and 'Multixacts and Wraparound'.
- Section: 'The Autovacuum Daemon'.
  - Skim over the middle of this section - whether a table will get checked, calculations for vacuum threshold etc.
  - TL;DR: autovacuum triggers because lots of records get deleted or updated.
  - The last six paragraphs are useful.
Learn about the different sorts of locks PostgreSQL uses:
- Locks apply at different levels (rows, tables) and there's different sorts of locks - some are more co-operative than others.
- 'Table-Level Locks'
- 'Row-Level Locks'
- 'Advisory Locks'
  - GitLab migrations use these. More later.
There's a list of activities that can run alongside vacuum on a table, and those which cannot. Note the statement under 'Commands conflicting with VACUUM on the table':
- If transaction 1 does a VACUUM on the table, transaction 2 is blocked from doing any of the commands listed, such as REINDEX or DROP INDEX.
- If transaction 1 does any of the following commands on the same table, transaction 2 is blocked from executing VACUUM.
- More comprehensive locks do not trump lighter locks: it's a first-come first-served model. The locks just block each other.

Locks block a reindexing operation.

Select a table to reindex such as users, projects, merge_requests or issues.
- The examples use projects (so it continues to work: it's unlikely to get partitioned)
View the table definition. Check it has indexes; they're the first main section below the column definitions for the table:
```
\d+ projects
```

In one session, start a transaction and then take a SHARE UPDATE EXCLUSIVE lock, which is the sort taken by vacuum.

-- check what the idle transaction timeout is
--
SHOW idle_in_transaction_session_timeout;
--
-- set it temporarily to indefinite, and ensure statements don't time out as well
--
SET idle_in_transaction_session_timeout = 0;
SET statement_timeout = 0;
--
-- start a transaction
--
BEGIN;
--
-- what's my process id?
--
select pg_backend_pid();
LOCK projects IN SHARE UPDATE EXCLUSIVE MODE;

In a second session, investigate the lock using the system view pg_locks:

-- take a look at the definition of pg_locks
--
\d+ pg_locks
--
-- see all the locks; GitLab is down to reduce the row count here!
--
SELECT * from pg_locks;
--
-- not all the columns are interesting, and the table name (relation) is not human readable
--
SELECT pid, locktype, mode, granted, relation::regclass FROM pg_locks;

You should see locks, including the ShareUpdateExclusiveLock you took out in the other session, plus AccessShareLock for the SELECT.

In the third session, let's try and reindex the table.

SET statement_timeout = 0;
select pg_backend_pid();
REINDEX TABLE CONCURRENTLY projects;

In the second session, look at locks
```
SELECT pid, locktype, mode, granted, relation::regclass FROM pg_locks;
```
- You'll see locks for both of your other two sessions: one granted: t, the other granted: f
Put the first and third sessions side by side, and then end the transaction that has the table lock:
```
END;
```
- The reindex will then execute and complete.

Locks block a GitLab database migration

In this section, the effect of locks on a more common customer situation is explored.

If a customer's environment had a long running autovacuum running over a table, they'd be unable to create indexes on that table.

The test system should be prepared as above
Only two database sessions are needed: the third session is used to run rake tasks.
Remove an index from the vulnerability_reads table, by rolling back a migration from 17.0.
```
sudo gitlab-rake db:migrate:down VERSION=20240410104838
```

Verify it's down

sudo gitlab-rake db:migrate:status | grep ' down '

Take a lock on the vulnerability_reads table.

This lock type is the one taken by autovacuum.

SET idle_in_transaction_session_timeout = 0;
SET statement_timeout = 0;
BEGIN;
LOCK vulnerability_reads IN SHARE UPDATE EXCLUSIVE MODE;

Look at what locks are held:

SELECT pid, locktype, mode, granted, relation::regclass FROM pg_locks;

Run the migration rake task, simulating a GitLab upgrade:
```
sudo gitlab-rake db:migrate
```
When the rake task is run interactively, output indicates what the current operation is. You won't always see this output in customer environments. During RHEL upgrades, for example, migration output is captured by yum/dnf and isn't displayed at all.

Expect to start from scratch in a customer situation, with no idea what operation is taking a long time.

Find out what long running queries there are:
```
SELECT pid,usename,application_name,query_start,wait_event_type,wait_event,state,query 
FROM pg_stat_activity
WHERE state <> 'idle'
  and query_start < now() - interval '5 seconds';
```
- autovacuum would show as a long running operation. The table name will be in either the application_name or query fields.
- In this example, instead you'll see idle in transaction | LOCK vulnerability_reads IN SHARE UPDATE EXCLUSIVE MODE;
- You should also see the rake session trying to CREATE INDEX CONCURRENTLY .. ON "vulnerability_reads"
Investigate locks:
```
SELECT pid, locktype, mode, granted, relation::regclass FROM pg_locks;
```
- Are all the locks granted?
- Compare the pid fields with the pg_stat_activity output; does CREATE INDEX CONCURRENTLY have a granted lock?
Look also for the advisory lock that the migrations have created.
```
 pid  |  locktype  |           mode           | granted |      relation       
------+------------+--------------------------+---------+---------------------
 1657 | advisory   | ExclusiveLock            | t       | 
```
You'll see matching information in the Rake task output.
```
main: == [advisory_lock_connection] object_id: 54880, pg_backend_pid: 1657
```
- This lock is tied to a specific session.
- If migrations run through PgBouncer, the operations may get moved to a different session, since PgBouncer has a pool of them available.
- The lock is then not on the same session that the rake task is using.
- Review issue 33010.
Tile the sessions so you can see what happens to the reindex when you remove the lock.

End the transaction that is holding the with the lock:
```
END;
```
If autovacuum is triggering for a table, simply cancelling it won't be enough - it'll come back automatically.
- See the workaround that involves manually vacuuming the table so it no longer requires the attention of autovacuum.

Be careful with locks

Certain operations are not performed by GitLab code because they take an intrusive lock that would prevent changes by Rails, or worse even SELECT is blocked:

ACCESS SHARE (AccessShareLock)

Conflicts with the ACCESS EXCLUSIVE lock mode only.

The SELECT command acquires a lock of this mode on referenced tables. In general, any query that only reads a table and does not modify it will acquire this lock mode.

(source)

One operation which would do this is REINDEX without CONCURRENTLY. Reindexing an index or a whole table is a change we might ask a customer to do.

Start GitLab
```
gitlab-ctl start
```

In a database console, take an ACCESS EXCLUSIVE lock on a table that a specific part of the product uses. For example, issues:

SET idle_in_transaction_session_timeout = 0;
SET statement_timeout = 0;
BEGIN;
LOCK issues IN ACCESS EXCLUSIVE MODE;

See what happens in the UI.
- Navigate to the issues view in a project.
- You'll find the page get stuck loading.
- You might run into other issues, for example populating issue counts for the menu bar.
- After some time, you'll get a 500 error, with a correlation ID.

While the page is trying to load, on a second console session:

-- long running sessions
--
SELECT pid,usename,application_name,query_start,wait_event_type,wait_event,state,query 
FROM pg_stat_activity
WHERE state <> 'idle'
  and query_start < now() - interval '5 seconds';
--
-- locks
--
SELECT pid, locktype, mode, granted, relation::regclass FROM pg_locks;

You might see more than one session from Puma failing to get a lock.

GitLab queries are prefixed with marginalia that includes the correlation ID:

/*application:web,correlation_id:01J8EZ7T6VFM60D85J561GP7WM,endpoint_id
                  :Projects::IssuesController#index,db_config_name:main*/

Complete the transaction
- The 500 errors should stop now.
```
END;
```

Completion

Close the issue

Edited Nov 18, 2025 by Luminus Alabi