Risk Analysis for Teleport/Console access directly into Gitlab's Main/CI patroni clusters
Risk Analysis for Teleport/Console access directly into Gitlab's Main/CI patroni clusters
Executive Summary
As action taken at https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15828#note_994928649, Teleport console will be configured to connect directly into Patroni's "backup" hosts, eg. patroni-06-db-gstg.c.gitlab-staging-1.internal
for GSTG main DB and patroni-ci-03-db-gstg.c.gitlab-staging-1.internal
for GSTG CI DB, in GPRD the deployment still is pending.
As discussed at https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15828#note_993371459, it is not a good practice to allow internal users (like SDEs, SREs, etc) to connect directly into a Main/CI cluster replicas, because a manually execution of a slow SQL query (intentionally or by accident) can cause excessive bloat and degrade application queries performance, because all Gitlab's patroni replicas have hot_standby_feedback
enabled.
Technical details
When replicas have hot_standby_feedback=on
they send a feedback of their oldest running queries (aka xmin
) into the cluster Writer/Leader node, which will use this information to delay the cleaning of dead tuples that are still being read and can't be vacuumed. This behaviour is designed in order to avoid query conflicts and the undesired query canceling in the Replicas (as per reference [1]).
Therefore, if a user that is testing or validating a badly written query, or if he needs to run a query that is know to be long/slow, this query can cause a massive bloat on heavily updated objects (tables, indexes, etc), and massive object bloats are a well known cause of performance degradation in the PostgreSQL engine.
_Note: Currently, all Patroni Replicas, including postgres-dr-archive
, have hot_standby_feedback=on
.
Reference [1] - PostgreSQL - Handling Query Conflicts - https://www.postgresql.org/docs/12/hot-standby.html#HOT-STANDBY-CONFLICT
Proposed Alternatives
Create a new cluster for Teleport/Console access, with hot_standby_feedback=off
(disabled) and with very large values for max_standby_archive_delay
and max_standby_streaming_delay
(these 2 parameters define the maximum time a query can run before being killed/cancelled when vacuum conflicts arise if the standby feedback is disabled, as per reference [1]).