Short/mid-term solutions for {archive,delayed} replica servers lagging replication

Introduction

This issue is both an analysis of recent issues with replica servers and an open discussion with proposals on how to fix them.

There have been several cases, but the most recent ones are:

and affect the same servers: postgres-dr-archive-01-db-gprd.c.gitlab-production.internal and postgres-dr-delayed-01-db-gprd.c.gitlab-production.internal. Both are in hot standby, archive mode: replicating changes that happen in the production node not via streaming replication, but rather by consuming WAL files from the archive (which is a GCS bucket).

In both cases, replication lag grew significantly, up to the point of reaching 4TB of lag, at the time equivalent to more than two days (of lag). These servers serve some kind of reporting workload (I don't have more details, if anyone wants to fill them in, feel free to edit here From @tayloramurphy: We use the archive replica as the source database for copying data to the data warehouse which a lot of downstream reporting. For our use case, we can handle a several (2-4) hour lag as there is some redundancy in the syncing queries we use.), and may work with some data staleness, but definitely not this experienced lag.

Brief summary of how Postgres (physical) replication works

Postgres physical replication works the same way regardless of whether you replicate via streaming or WAL archives: each time a "change" (WAL record) is generated on the master node, it will eventually end up in the replica. Once it arrives at the replica, Postgres will apply it.

However, it cannot be applied blindly. These WAL records may cause some conflicts with the read-only queries that may be happening at that time in the replica. On the master this is not an issue, since there is appropriate locking in place that orders potential conflicts and operations are ordered in such a way that conflicts are resolved, and then WALs are generated when appropriate. However, because read queries may happen on the replica, and they are subject to their own locking (even lighter, read-only operations also use locks), they may be different and not entirely compatible with the "write" operations coming from the WAL.

To understand these conflicts, a couple of examples:

A trivial example is a read-only query on a table, and an incoming WAL record with the command to DROP that table.
A more frequent use case is a WAL record asking to mark a row as reusable, as it was vacuumed on the master, but is still being referenced by a current snapshot (running query) on the replica.

Under this situation, there are possible compromises, which force a business decision here:

Use or not hot_standby_feedback. If active, replica sends information to the master node about currently executing read-only queries, and master will prevent in the origin operations that may conflict with replica's currently-running queries. This is pretty much as if queries on the replica would be executing on the master. The side effect of this is that the master node will hold back certain operations (such as vacuum) that may end up mounting pressure on the master. This could become a significant bad effect on the master node, and for busy master nodes the general recommendation is to avoid this setting active to not put more burden on the master node. I'd say hot_standby_feedback=on is probably not a good option for GitLab.
Define the values of max_standby_archive_delay (or max_standby_streaming_delay for streaming replicas). These parameters will control for how long a query is allowed to run on a replica before being forcibly cancelled, if an incoming WAL would be waiting on this query to be finished to be applied. The compromise here is lag delay (cause by the WAL apply being held back) versus user experience (users may experience query cancellations).

Root cause analysis

Based on the observed issues, there were apparently two effects contributing to the replica lag: a large value for max_standby_archive_delay and WAL-e, that slower than required when fetching WAL files.

`max_standby_archive_delay`

This value was initially set to 10min. That means that if a given query running on the replica has a conflict with an incoming WAL, the WAL will be held back until the query finishes. If the query doesn't finish in 10 minutes, it will be cancelled and then the WAL files (and any queued after the first one) would be applied. As such, it's an easy conclusion to think that this value won't lead to a lag significantly larger than 10 minutes. But that's not the case.

An individual query may only lag the replication by this amount. But if there were running, in parallel, yet another query, with the same conflict, that started 5 minutes later, then the WAL will be held back for 15 minutes in this case. It is easy to understand how, in the presence of frequent/continuous queries that may end up in WAL recovery conflicts, lag may grow boundlessness.

Upon issues resolution, OnGres team found very frequently WALs paused for apply. They look like in a process listing like:

gitlab-+  8926  8921  9 Mar05 ?        01:51:10 postgres: startup process   waiting for 0000001F0001DA8C0000006B

How is it possible that such conflicts are so frequent? For example, queries like the one spotted by @gerardo.herzig on production#1703 (comment 300511503) show an example of a long-running query (more than 10 minutes!). Its execution plan on the DR node is:

gitlabhq_production=# explain select date_trunc('year', created_at) as day, count(*) from ci_builds where project_id=278964 group by 1 order by 1 desc;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=26668255.76..27066303.56 rows=19746298 width=16)
   Group Key: (date_trunc('year'::text, created_at))
   ->  Sort  (cost=26668255.76..26718662.12 rows=20162543 width=8)
         Sort Key: (date_trunc('year'::text, created_at)) DESC
         ->  Index Scan using index_ci_builds_on_project_id_and_id on ci_builds  (cost=0.57..23690387.66 rows=20162543 width=8)
               Index Cond: (project_id = 278964)
(6 rows)

The query scans 20M rows! If any of these rows were vacuumed on the master node, the WAL record containing such a record could not be applied on the replica until this query finishes.

So we can conclude that this WAL recovery applying conflicts are potentially frequent, and that they delay WAL apply, contributing to replica lag. Further evidence in this direction is the following graph:

https://prometheus-db.gprd.gitlab.net/graph?g0.range_input=5m&g0.end_input=2020-03-07%2009%3A59&g0.expr=(pg_replication_lag)%20and%20on(instance)%20(pg_replication_is_replica%7Btype%3D%22postgres-archive%22%7D%20%3D%3D%201)&g0.tab=0

This is at a time when many WALs were prefetched manually and locally, and Postgres was asked to apply them at full speed, without having the network involved (see production#1703 (comment 300841554)). CPU was far from saturating, even the apply is single-threaded. Thus, the lag reduction graph is expected to be a steady drop in the lag. However, it is more ladder-shaped, with periods of a little bit over 10s were lag apparently does not decrease, and then suddenly drops. Coincidently, at this time max_standby_archive_delay=10s.

In other words: there appears to be frequent conflicts on the WAL recovery process due to concurrent queries on the replicas. The larger max_standby_archive_delay is, the more potential lag the replica may experience. But even reducing it significantly is not a guarantee that lag can grow very large.

WAL-e fetching speed

As I commented on production#1703 (comment 296194554), WAL producer (Postgres master) and consumer (the replica replicating via archive from GCS) have different peak speeds. Roughly speaking, the replica is consuming at most 2 WAL segments per second (these numbers were obtained by parsing WAL-e logs and fetching the timestamp of fetched WAL files), while the master produces at times 3 or more WAL segments per second.

[Off-topic, but potentially subject to consideration: master is subject to a change rate that on times approaches 4TB/day; this is a significant volume, and maybe separate issues may be opened to analyze this and understand the high rate of changes and if it could be optimized]

WAL-e fetches files from GCS "on-demand": when Postgres has processed the last WAL file that it has, it calls wal-e to fetch the next one, if that would exist. With a twist: it supports pre-fetching. Fetching one WAL file from the archive does seem slow, takes around 400-600ms). Since WAL-e supports pre-fetching, it shouldn't be a problem: even if you request now WAL file X, several would be downloaded in parallel so that they would arrive before they needed. Even WAL-e would be called anyway from Postgres, it uses a local filesystem cache (directory .wal-e) for prefetched files. However, we haven't seen the prefetch mechanism to work well, or to be up to the speed that would be expected.

At first, OnGres team proposed to increase the prefetch to 32 (concurrency level). See production#1703 (comment 300020954). It didn't do much. We neither observed WAL-e launching 33 processes at once. At most, you'd observe a handful. Maybe its logic is smarter than us, but the real fact backed by rate of download as observed in the logs is that regardless of the prefetch value, WAL-e is not capable of delivering more than 2-3 WAL segments per second. And sometimes, production rate on the master is higher than this.

Extra item: WAL applying and WAL fetching are in lock-step

Based on the above, and focusing only on the second issue (WAL-e speed), we might be tempted to think that a faster fetcher would do the job. It is demonstrably possible, a prefetch script that I wrote (production#1703 (comment 300658210)) was able to pull WAL segments at a rate of exactly pi segments per second, with four threads, and scales almost linearly. As such, using a tool like WAL-g, that is being discussed elsewhere, and probably a good idea anyway, may not provide significant benefit here anyway.

The reason is that this fetch, even if happening in parallel, faster and/or with prefetch, is still synchronous with the call from Postgres. I.e., at the moment Postgres calls this script, it is waiting without a job to do (apply a new WAL). Now, it needs to wait until the fetching mechanism brings some files, and this takes time regardless. In the meantime, a conflicting query may have been started and probably let in, which means that now Postgres will further wait another 10 seconds or more to apply the fetched WALs. And so and so on.

In other words: we could see some improvements with other WAL fetchers, like WAL-g, or custom scripts, but the fact that they are called synchronously, whereas WAL production on the master is asynchronous with respect to this process, means that we are not downloading files when produced, and we are forcing the replica to periods of staleness in WAL processing, that contribute to replica lag, specially when master is under heavy activity, non-stop.

Potential solutions

There are, potentially, several possible solutions. This requires some further discussions, as some of them require much more implementation effort -and even developing efforts-, so it's not only a technical decision. Due to the time some of them may require, another temporary solutions (short-term solutions, below) are proposed.

If no solution is implemented, it is highly likely that these servers will lag again, sooner than later. Could be in a matter of days (obviously, it depends on the activity of the database).

Short-term solutions

IMHO, none of them would provide guarantees of no lag replicas, and may require manual intervention to fix excessive lag. Some of them may be combined.

Decrease max_standby_archive_delay to the minimum possible value. In any case, do not return it to the previous 10min.
Switch to WAL-G for faster, parallel WAL retrieval.
Write a permanently-running pre-fetcher script. Periodically, it wakes up and fetches an existing logical WAL file (256 x 16MB files). It could be based on the existing script I wrote (production#1703 (comment 300658210)). It could be further improved to call GCP CLI API, rather than wal-e, to fetch each of the individual files. Then, adjust recovery_command such that it would look in the prefetch cache if the file exists; and if it not, to call WAL-e as it happens now.

Medium-term solutions

These solutions are based on the concept that due to the high rate of WAL production, WAL download should not happen in lockstep with WAL applying. Indeed, the goal is to lock-step downloads of WALs with their proudction (i.e.: as soon as one WAL is produced, download it and cache it locally on the replicas). For this, we need some kind of "synchronization" between the master and the replicating nodes.

Possible solutions, subject to further discussion:

Use Cloud Pub/Sub, or any queue system, for that matter, to publish the event of when a WAL file is produced. Master will write here as part of the archive_command (or, better, as part of an asynchronous command fired by the archive_command, whose result would never affect the outcome of the archive command itself). Then, have a consumer of this pubsub on the replicas, that would be watching for new items and, when a new one is published, fetch it from GCS and store in local cache. Then, restore_command checks the cache and, if non-existing (shouldn't) fetches it via wal-e or other methods. Note that only the filename / path is required as part of the payload for the pubsub messages.
Use Kafka to store and retrieve WAL segments. In this case, the payload is the WAL segment itself (16MB, typically 4-5MB when compressed; Kafka supports up to 512MB payloads). So master's archive_command would store the WAL file in Kafka, and restore_command would work similarly to the approach described on the previous item, but fetching the WALs from Kafka. This simplifies a bit the architecture, at the expense of (the complexity of) a separate Kafka cluster. However, this approach, if extended also to work in coordination with base backups, could be a great foundation to completely offload replica provisioning, restoring and cloning from the Kafka cluster. This is a solution I already hinted before (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7293#note_258902647) and I'd like to insist on again ;)
Use logical replication for reporting nodes. Logical replication will avoid completely the "incoming changed being held back by conflicting read-only operations" problem. Under logical replication, incoming changes are treated like normal write operations, and as such are ordered with the locking mechanisms of Postgres, but do not queue up like WALs do. However, this is challenging: for one, Postgres logical replication does not replicate events like DDL changes, or sequence increments. As such, this would require significant work. But would avoid most if not all of the described problems and may be the only solution which may tolerate an arbitrary high query timeout (reporting queries running on the server). Deep discussion would be needed here.

/cc @Finotto

[edit: added the logical replication option, instead of proposing it on a separate issue]

Database

Edited Mar 09, 2020 by Taylor A Murphy