Skip to content

gdk reset-data --fast fails and leaves Postgres/runit in a difficult-to-recover state

This was originally sighted in gdk reset-data --fast fails with 'relation "fea... (#578920 - closed) ("left my database in an Interesting™ state"), but that issue was for the more-prominent problem (feature_gates-related) left over after manual recovery.

When verifying the fix for that issue, the problem with Postgres being left in a weird state recurred:

kivikakk@rebane /V/g/gdk ((no description set) rowyq 12e6c (empty))> gdk reset-data --fast
⚠️  WARNING: We're about to remove _all_ (GitLab and praefect) PostgreSQL data, Rails uploads and git repository data.
⚠️  WARNING: Backups will be made in '/Volumes/g/gdk/.backups', just in case!
Are you sure? [y/N]: y
=> Retrying stop (1/3)
=> Retrying stop (1/3)
=> Moving PostgreSQL data from '/Volumes/g/gdk/postgresql/data' to '/Volumes/g/gdk/.backups/postgresql/data.2025-11-05_14.16.18/'
=> Moving redis dump.rdb from '/Volumes/g/gdk/redis/dump.rdb' to '/Volumes/g/gdk/.backups/redis/dump.rdb.2025-11-05_14.16.18/'
=> Moving Rails uploads from '/Volumes/g/gdk/gitlab/public/uploads' to '/Volumes/g/gdk/.backups/gitlab/public/uploads.2025-11-05_14.16.18/'
=> Moving git repository data from '/Volumes/g/gdk/repositories' to '/Volumes/g/gdk/.backups/repositories.2025-11-05_14.16.18/'
ok: down: /Volumes/g/gdk/services/gitlab-http-router: 0s
ok: down: /Volumes/g/gdk/services/gitlab-topology-service: 0s
ok: down: /Volumes/g/gdk/services/gitlab-workhorse: 0s
ok: down: /Volumes/g/gdk/services/jaeger: 0s
ok: down: /Volumes/g/gdk/services/rails-background-jobs: 0s
ok: down: /Volumes/g/gdk/services/rails-web: 0s
ok: down: /Volumes/g/gdk/services/sshd: 0s
ok: down: /Volumes/g/gdk/services/vite: 0s
ok: down: /Volumes/g/gdk/services/praefect: 0s
ok: down: /Volumes/g/gdk/services/praefect-gitaly-0: 0s
ok: down: /Volumes/g/gdk/services/redis: 0s
ok: down: /Volumes/g/gdk/services/postgresql: 1s
The files belonging to this database system will be owned by user "kivikakk".
This user must also own the server process.

The database cluster will be initialized with locale "C".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory /Volumes/g/gdk/postgresql/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Australia/Melbourne
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... initdb: warning: enabling "trust" authentication for local connections
initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.
ok


Success. You can now start the database server using:

    /Users/kivikakk/.local/share/mise/installs/postgres/16.10/bin/pg_ctl -D /Volumes/g/gdk/postgresql/data -l logfile start


--------------------------------------------------------------------------------
Ensuring necessary data services are running
--------------------------------------------------------------------------------
ok: run: /Volumes/g/gdk/services/praefect-gitaly-0: (pid 66349) 0s, normally down
ok: run: /Volumes/g/gdk/services/redis: (pid 66350) 0s, normally down
ok: run: /Volumes/g/gdk/services/praefect: (pid 66451) 0s, normally down
timeout: down: /Volumes/g/gdk/services/postgresql: 0s, want up
make: *** [ensure-databases-running] Error 1
[sentry] `config.logger` is deprecated. Please use `config.sdk_logger` instead.
❌️ ERROR: Failed to reset data.
-------------------------------------------------------
You can try the following that may be of assistance:

- Run 'gdk doctor'.

- Visit the troubleshooting documentation:
  https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/troubleshooting/index.md.
- Visit https://gitlab.com/gitlab-org/gitlab-development-kit/-/issues to
  see if there are known issues.

- Run 'gdk reset-data' if appropriate.
- Run 'gdk pristine' to reinstall dependencies, remove temporary files, and clear caches.
-------------------------------------------------------
kivikakk@rebane /V/g/gdk ((no description set) rowyq 12e6c (empty)) [1]>
kivikakk@rebane /V/g/gdk ((no description set) rowyq 12e6c (empty))> gdk status
+--------+---------------------+--------------------------+
| PID    | STATUS              | SERVICE                  |
+--------+---------------------+--------------------------+
|        | down 77s            | gitlab-http-router       |
|        | down 79s            | gitlab-topology-service  |
|        | down 78s            | gitlab-workhorse         |
|        | down 77s            | jaeger                   |
|        | down (want up) -2s  | postgresql               |
|        | down (want up) 1s   | praefect                 |
| 66349  | up 77s              | praefect-gitaly-0        |
|        | down 78s            | rails-background-jobs    |
|        | down 80s            | rails-web                |
| 66350  | up 79s              | redis                    |
|        | down 78s            | sshd                     |
|        | down 80s            | vite                     |
+--------+---------------------+--------------------------+

=> GitLab available at http://gdk.test:3000
=>   - Ruby: ruby 3.3.9 (2025-07-24 revision f5c772fc7c) [arm64-darwin24].
=>   - Node.js: v22.19.0.
=> The HTTP Router is available at http://gdk.test:3000
=> The TopologyService is up and running.
kivikakk@rebane /V/g/gdk ((no description set) rowyq 12e6c (empty))> gdk tail postgresql
2025-11-05_03:16:18.51688 postgresql              : runit control/t: sending TERM to -35704
2025-11-05_03:16:18.51746 postgresql              : runit control/t: sending TERM to 35704
2025-11-05_03:16:18.51753 postgresql              : 2025-11-05 14:16:18.516 AEDT [35714] LOG:  received smart shutdown request
2025-11-05_03:16:18.51844 postgresql              : 2025-11-05 14:16:18.517 AEDT [35714] LOG:  received fast shutdown request
2025-11-05_03:16:18.51852 postgresql              : 2025-11-05 14:16:18.518 AEDT [35714] LOG:  background worker "logical replication launcher" (PID 35726) exited with exit code 1
2025-11-05_03:16:18.52092 postgresql              : 2025-11-05 14:16:18.520 AEDT [35721] LOG:  shutting down
2025-11-05_03:16:18.52139 postgresql              : 2025-11-05 14:16:18.520 AEDT [35721] LOG:  checkpoint starting: shutdown immediate
2025-11-05_03:16:18.73984 postgresql              : 2025-11-05 14:16:18.739 AEDT [35721] LOG:  checkpoint complete: wrote 11164 buffers (8.5%); 0 WAL file(s) added, 0 removed, 7 recycled; write=0.216 s, sync=0.001 s, total=0.219 s; sync files=0, longest=0.000 s, average=0.000 s; distance=119000 kB, estimate=119000 kB; lsn=1/24F0AAA0, redo lsn=1/24F0AAA0
2025-11-05_03:16:18.75887 postgresql              : 2025-11-05 14:16:18.758 AEDT [35714] LOG:  database system is shut down
2025-11-05_03:16:18.76018 postgresql              : Sending INT to 35714
^C⏎                                                                                                                                    
kivikakk@rebane /V/g/gdk ((no description set) rowyq 12e6c (empty))> ps aux|grep [p]ostgre
kivikakk         35672   1.1  0.0 435298944   1136   ??  Ss   Fri12PM   0:01.07 runsv postgresql
kivikakk         66320   0.0  0.0 435299584   1456   ??  S     2:16PM   0:00.04 runsvdir -P /Volumes/g/gdk/services log: /lock: temporary failure\012runsv postgresql: fatal: unable to lock supervise/lock: temporary failure\012runsv postgresql: fatal: unable to lock supervise/lock: temporary failure\012runsv postgresql: fatal: unable to lock supervise/lock: temporary failure\012runsv postgresql: fatal: unable to lock supervise/lock: temporary failure\012runsv postgresql: fatal: unable to lock supervise/lock: temporary failure\012
kivikakk@rebane /V/g/gdk ((no description set) rowyq 12e6c (empty))> 

This is a similar failure mode to what I observed last time; I have to manually kill the runsvdir process (66320) to get the system back into a state where gdk commands can control the Postgres process (via runit).

After killing that, and then gdk stopping everything again, I once again tried a gdk reset-data --fast, and everything looks like it worked:

In short, the runsvdir process gets stuck during the reset attempt, and ps aux reveals the following:

kivikakk 66320 0.0 0.0 435299584 1456 ?? S 2:16PM 0:00.04 runsvdir -P /Volumes/g/gdk/services log: /lock: temporary failure\012runsv postgresql: fatal: unable to lock supervise/lock: temporary failure\012runsv postgresql: fatal: unable to lock supervise/lock: temporary failure\012runsv postgresql: fatal: unable to lock supervise/lock: temporary failure\012runsv postgresql: fatal: unable to lock supervise/lock: temporary failure\012runsv postgresql: fatal: unable to lock supervise/lock: temporary failure\012

gdk stop etc. do not get the process out of this state, and Postgres is non-functional while in it: the only way to recover its ability to control Postgres (and therefore do much of anything) is to manually kill the runsvdir process.

/cc @gl-dx/development-tooling @splattael

Edited by 🤖 GitLab Bot 🤖