Error connecting to the database after Failover
We are using a VIP (Virtual IP) to establish a connection with our PostgreSQL Cluster. During a failover, the replica takes over as the new master, and the VIP is redirected to the new master. After a database server failover, gitlab-exporter generates errors in the logs:
E, [2024-04-30T12:45:13.726973 #1] ERROR -- : Error connecting to the database: PQsocket() can't get socket descriptor
E, [2024-04-30T12:45:13.727130 #1] ERROR -- : Error connecting to the database: PQsocket() can't get socket descriptor
E, [2024-04-30T12:45:13.727250 #1] ERROR -- : Error connecting to the database: PQsocket() can't get socket descriptor
11.32.199.13 - - [30/Apr/2024:12:45:03 UTC] "GET /metrics HTTP/1.1" 200 159126
- -> /metrics
E, [2024-04-30T12:45:25.390779 #1] ERROR -- : Error connecting to the database: PQsocket() can't get socket descriptor
E, [2024-04-30T12:45:25.391144 #1] ERROR -- : Error connecting to the database: PQsocket() can't get socket descriptor
E, [2024-04-30T12:45:25.391244 #1] ERROR -- : Error connecting to the database: PQsocket() can't get socket descriptor
11.32.221.10 - - [30/Apr/2024:12:45:14 UTC] "GET /metrics HTTP/1.1" 200 159107
- -> /metrics
E, [2024-04-30T12:45:36.120069 #1] ERROR -- : Error connecting to the database: PQsocket() can't get socket descriptor
E, [2024-04-30T12:45:36.120515 #1] ERROR -- : Error connecting to the database: PQsocket() can't get socket descriptor
E, [2024-04-30T12:45:36.120642 #1] ERROR -- : Error connecting to the database: PQsocket() can't get socket descriptor
11.32.221.8 - - [30/Apr/2024:12:45:24 UTC] "GET /metrics HTTP/1.1" 200 159110
- -> /metrics
After consulting with GitLab support, it has been determined that these errors may be caused by:
I think I understand what's happening.
When the
Database::Base
class is initialized, it creates a connection pool The main PostgreSQL nodes "goes away", secondary takes over The secondary is unaware of the pooled connection The exception on L#66 is thrownconn.reset
is not called🙃 This might be a scenario where
conn.reset
could save the day.Using PgBouncer would avoid this scenario, as it keeps track of existing connections.
Other users may also be experiencing the same issue.