PostgreSQL Locks - Luminus Alabi
module-name: "PostgreSQL locks"
area: "Product Knowledge"
gitlab-group: "Database"
maintainers:
- bprescott_
- Training name and description
- Prerequisites
- Create Your Module
- Introduction
- Prepare test system
- Theory
- Locks block a reindexing operation.
- Locks block a GitLab database migration
- Be careful with locks
- Completion
Training name and description
Goal: You understand how PostgreSQL uses locks, and why operations GitLab performs during upgrades such as creating indexes might stall because of lock acquisition.
Length: 1-2 hours
Objectives: At the end of this module, you should be able to:
- Investigate what sessions are live on the database.
- Find out what locks are in place.
- Find out what locks have been requested.
- Spot the messages about advisory locks when migrations are run.
Prerequisites
A single node Omnibus installation: if your test environment is scaled out or deployed in K8S, that will work as well, but you will need to figure out how to achieve some steps.
Ideally, this should be running GitLab 17.x: a specific migration is used as part of this training to remove and add an index.
- This migration (and index) needs to be present in the code base for all the training to work.
- At some point the migration will be removed from the code base, and this module will need to be updated with an alternative.
- Any version of Gitlab will be fine for the first and third activities.
Create Your Module
-
Create an issue using this template by making the Issue Title: PostgreSQL locks training - <your name>. -
Add yourself as the assignee.
Introduction
A number of customers upgrading to GitLab releases around 16.11 and 17.0 ran into issues with long running database migrations.
A very common root cause was that autovacuum had selected a table for work, and this was preventing GitLab migrations from making modifications to that table.
During normal operations of GitLab, locks can cause issues for instances, such as
- Performance issues as queries are delayed while a lock is obtained.
- Sidekiq runs out of worker threads because lots of the configured threads are queuing for a lock.
Where to ask for help
If at any time, you have questions, ask in slack: #spt_pod_database.
Prepare test system
These steps can be done in parallel with the theory section below.
-
Boot up your test machine. -
Shut down Puma, Sidekiq, Gitlab exporter, PostgreSQL exporter. - The main tasks in this module are performed with no other clients on the database. It'll be easier to see what's going on.
- To run migrations, Rails will need all backend components.
- On a single node install, this is what's needed:
gitlab-ctl stop gitlab-ctl start postgresql gitlab-ctl start gitaly gitlab-ctl start redis -
Start three database console sessions. sudo gitlab-psql
Theory
-
Read about vacuuming in PostgreSQL
-
Introduction, and the Section: 'Vacuuming Basics'. -
Section: 'Recovering Disk Space'. -
Section: 'Updating Planner Statistics'. - The next three sections are not requied reading: 'Updating the Visibility Map', 'Preventing Transaction ID Wraparound Failures', and 'Multixacts and Wraparound'.
-
Section: 'The Autovacuum Daemon'. - Skim over the middle of this section - whether a table will get checked, calculations for
vacuum thresholdetc. - TL;DR: autovacuum triggers because lots of records get deleted or updated.
- The last six paragraphs are useful.
- Skim over the middle of this section - whether a table will get checked, calculations for
-
- Learn about the different sorts of locks PostgreSQL uses:
- Locks apply at different levels (rows, tables) and there's different sorts of locks - some are more co-operative than others.
-
'Table-Level Locks' -
'Row-Level Locks' -
'Advisory Locks' - GitLab migrations use these. More later.
-
There's a list of activities that can run alongside vacuum on a table, and those which cannot. Note the statement under 'Commands conflicting with VACUUM on the table': - If transaction 1 does a VACUUM on the table, transaction 2 is blocked from doing any of the commands listed, such as
REINDEXorDROP INDEX. - If transaction 1 does any of the following commands on the same table, transaction 2 is blocked from executing VACUUM.
- More comprehensive locks do not trump lighter locks: it's a first-come first-served model. The locks just block each other.
- If transaction 1 does a VACUUM on the table, transaction 2 is blocked from doing any of the commands listed, such as
Locks block a reindexing operation.
-
Select a table to reindex such as users,projects,merge_requestsorissues.- The examples use
projects(so it continues to work: it's unlikely to get partitioned)
- The examples use
-
View the table definition. Check it has indexes; they're the first main section below the column definitions for the table: \d+ projects -
In one session, start a transaction and then take a SHARE UPDATE EXCLUSIVElock, which is the sort taken byvacuum.-- check what the idle transaction timeout is -- SHOW idle_in_transaction_session_timeout; -- -- set it temporarily to indefinite, and ensure statements don't time out as well -- SET idle_in_transaction_session_timeout = 0; SET statement_timeout = 0; -- -- start a transaction -- BEGIN; -- -- what's my process id? -- select pg_backend_pid(); LOCK projects IN SHARE UPDATE EXCLUSIVE MODE; -
In a second session, investigate the lock using the system view pg_locks:-- take a look at the definition of pg_locks -- \d+ pg_locks -- -- see all the locks; GitLab is down to reduce the row count here! -- SELECT * from pg_locks; -- -- not all the columns are interesting, and the table name (relation) is not human readable -- SELECT pid, locktype, mode, granted, relation::regclass FROM pg_locks;- You should see locks, including the
ShareUpdateExclusiveLockyou took out in the other session, plusAccessShareLockfor theSELECT.
- You should see locks, including the
-
In the third session, let's try and reindex the table. SET statement_timeout = 0; select pg_backend_pid(); REINDEX TABLE CONCURRENTLY projects; -
In the second session, look at locks SELECT pid, locktype, mode, granted, relation::regclass FROM pg_locks;- You'll see locks for both of your other two sessions: one
granted: t, the othergranted: f
- You'll see locks for both of your other two sessions: one
-
Put the first and third sessions side by side, and then end the transaction that has the table lock: END;- The reindex will then execute and complete.
Locks block a GitLab database migration
In this section, the effect of locks on a more common customer situation is explored.
If a customer's environment had a long running autovacuum running over a table, they'd be
unable to create indexes on that table.
-
The test system should be prepared as above -
Only two database sessions are needed: the third session is used to run rake tasks. -
Remove an index from the vulnerability_readstable, by rolling back a migration from 17.0.sudo gitlab-rake db:migrate:down VERSION=20240410104838 -
Verify it's downsudo gitlab-rake db:migrate:status | grep ' down ' -
Take a lock on the vulnerability_readstable.- This lock type is the one taken by
autovacuum.
SET idle_in_transaction_session_timeout = 0; SET statement_timeout = 0; BEGIN; LOCK vulnerability_reads IN SHARE UPDATE EXCLUSIVE MODE; - This lock type is the one taken by
-
Look at what locks are held: SELECT pid, locktype, mode, granted, relation::regclass FROM pg_locks; -
Run the migration rake task, simulating a GitLab upgrade: sudo gitlab-rake db:migrate -
When the rake task is run interactively, output indicates what the current operation is. You won't always see this output in customer environments. During RHEL upgrades, for example, migration output is captured by yum/dnfand isn't displayed at all.Expect to start from scratch in a customer situation, with no idea what operation is taking a long time.
Find out what long running queries there are:
SELECT pid,usename,application_name,query_start,wait_event_type,wait_event,state,query FROM pg_stat_activity WHERE state <> 'idle' and query_start < now() - interval '5 seconds';-
autovacuumwould show as a long running operation. The table name will be in either theapplication_nameorqueryfields. - In this example, instead you'll see
idle in transaction | LOCK vulnerability_reads IN SHARE UPDATE EXCLUSIVE MODE; - You should also see the
rakesession trying toCREATE INDEX CONCURRENTLY .. ON "vulnerability_reads"
-
-
Investigate locks: SELECT pid, locktype, mode, granted, relation::regclass FROM pg_locks;- Are all the locks granted?
- Compare the
pidfields with thepg_stat_activityoutput; doesCREATE INDEX CONCURRENTLYhave a granted lock?
-
Look also for the advisory lock that the migrations have created. pid | locktype | mode | granted | relation ------+------------+--------------------------+---------+--------------------- 1657 | advisory | ExclusiveLock | t |You'll see matching information in the Rake task output.
main: == [advisory_lock_connection] object_id: 54880, pg_backend_pid: 1657- This lock is tied to a specific session.
- If migrations run through PgBouncer, the operations may get moved to a different session, since PgBouncer has a pool of them available.
- The lock is then not on the same session that the rake task is using.
-
Review issue 33010.
-
Tile the sessions so you can see what happens to the reindex when you remove the lock. End the transaction that is holding the with the lock:
END; -
If
autovacuumis triggering for a table, simply cancelling it won't be enough - it'll come back automatically.-
See the workaround that involves manually vacuuming the table so it no longer requires the attention of autovacuum.
-
Be careful with locks
Certain operations are not performed by GitLab code because they take an intrusive lock that would prevent
changes by Rails, or worse even SELECT is blocked:
ACCESS SHARE(AccessShareLock)Conflicts with the
ACCESS EXCLUSIVElock mode only.The
SELECTcommand acquires a lock of this mode on referenced tables. In general, any query that only reads a table and does not modify it will acquire this lock mode.
(source)
One operation which would do this is REINDEX without CONCURRENTLY. Reindexing an index or a whole table is
a change we might ask a customer to do.
-
Start GitLab gitlab-ctl start -
In a database console, take an ACCESS EXCLUSIVElock on a table that a specific part of the product uses. For example, issues:SET idle_in_transaction_session_timeout = 0; SET statement_timeout = 0; BEGIN; LOCK issues IN ACCESS EXCLUSIVE MODE; -
See what happens in the UI. - Navigate to the issues view in a project.
- You'll find the page get stuck loading.
- You might run into other issues, for example populating issue counts for the menu bar.
- After some time, you'll get a 500 error, with a correlation ID.
-
While the page is trying to load, on a second console session: -- long running sessions -- SELECT pid,usename,application_name,query_start,wait_event_type,wait_event,state,query FROM pg_stat_activity WHERE state <> 'idle' and query_start < now() - interval '5 seconds'; -- -- locks -- SELECT pid, locktype, mode, granted, relation::regclass FROM pg_locks;-
You might see more than one session from Puma failing to get a lock.
-
GitLab queries are prefixed with marginalia that includes the correlation ID:
/*application:web,correlation_id:01J8EZ7T6VFM60D85J561GP7WM,endpoint_id :Projects::IssuesController#index,db_config_name:main*/
-
-
Complete the transaction - The 500 errors should stop now.
END;
Completion
-
Close the issue