Pause migration when WAL queue pending archival crossed a threshold

added to epic &7594

changed the description

mentioned in issue #353395 (closed)

added groupdatabase label

marked this issue as related to #353395 (closed)

added Engineering Allocation label

@gitlab-bot, please ensure the required labels are present for Engineering Allocation measurements:

An ~Eng-Consumer::* label
An ~Eng-Producer::* label
A ~priority::* label
A ~severity::* label when the type is ~"bug"

Setting label(s) Category:Database devopsenablement sectionenablement based on groupdatabase.

added Category:Database devopssystems sectioncore platform labels

@abrandl, please can you add a type label to this issue to help with issue discovery in issue reports.

added auto updated label

added featureenhancement label

added typefeature label

@abrandl Some functions were renamed with PostgreSQL 10:

pg_current_xlog_insert_location -> pg_current_wal_insert_lsn
pg_xlogfile_name -> pg_walfile_name

so the query should look like

    WITH
      current_wal_file AS (
         SELECT CASE WHEN NOT pg_is_in_recovery() THEN pg_walfile_name(pg_current_wal_insert_lsn()) ELSE NULL END pg_walfile_name
      ),
      current_wal AS (
        SELECT
          ('x'||substring(pg_walfile_name,9,8))::bit(32)::int log,
          ('x'||substring(pg_walfile_name,17,8))::bit(32)::int seg,
          pg_walfile_name
        FROM current_wal_file
      ),
      archive_wal AS(
        SELECT
          ('x'||substring(last_archived_wal,9,8))::bit(32)::int log,
          ('x'||substring(last_archived_wal,17,8))::bit(32)::int seg,
          last_archived_wal
        FROM pg_stat_archiver
      )
    SELECT coalesce(((cw.log - aw.log) * 256) + (cw.seg-aw.seg),'NaN'::float) as pending_wal_count FROM current_wal cw, archive_wal aw

right?

@krasio Umh, so I "stole" this from here, which is what we're currently using to report the "number of WAL files pending archival" metric to prometheus.

Since this worked (and you are right about the PG10 rename), I was curious and found we seem to have some sort of backwards compatibility going on still:

gitlabhq_production=> \df pg_current_xlog_location
                                 List of functions
 Schema |           Name           | Result data type | Argument data types | Type 
--------+--------------------------+------------------+---------------------+------
 public | pg_current_xlog_location | pg_lsn           |                     | func
(1 row)

CREATE OR REPLACE FUNCTION public.pg_current_xlog_location()
 RETURNS pg_lsn
 LANGUAGE sql
 STABLE
AS $function$
 SELECT pg_current_wal_lsn();
 $function$

I'm not aware this came through gitlab-rails, so I suspect this is .com specific. Thinking it may be time to clean this up, although we would first have to adapt the monitoring queries accordingly.

@alexander-sosna FYI - do you happen to know more context around this?

And also, do we have the same functions installed for other databases that came up recently, too (like for the registry)?

Update: it seems yes

assigned to @krasio

Note: We should check if any of those functions are not supported by Amazon Aurora. I know that at least pg_is_in_recovery is not supported, so this could be a blocker.

We should not release this until #342093 (closed) is completed and we add a release post announcement with the consensus from gitlab-org/gitlab#342542. I expect this to be done mid next milestone, so we will be able to add this starting in %15.0.

Edit: maybe %15.1 if we want to allow those instances to upgrade to %15.0 before switching Databases. We'll have to think about that

@iroussos Thanks, will keep this in mind. I think we may be even able to skip the pg_is_in_recovery part - if we're trying to execute background migration and the primary database is in recovery (not sure if both these can happen at the same time) we probably have bigger problems.

Also we (at least I) do not know the exact implementation we gonna have, as @abrandl mentioned in one of the related issues that if such signals fail we'll just ignore them and move on:

In any case, we'll implement this signal checking as optional - in case of a lack of permissions, we simply won't have this signal to work with (and don't error out).

#357248 (comment 897456983)

Right @krasio - let's implement this fail-safe and carry on in case we are unable to get an indicator.

@iroussos Aren't we breaking stuff for Aurora today anyways? Sorry I haven't been following, but isn't our load balancing code already not compatible with Aurora?

[..] let's implement this fail-safe and carry on in case we are unable to get an indicator.

I would say that if the pg_is_in_recovery is important that we should hold off for a milestone and properly release the feature. I am personally not sure if other functions like last_archived_wal are also not supported, but @mattkasa could check the query while working on #342093 (closed) if we really want to get this out.

Or we should catch and recover from unknown functions, but this may be tricky and too much work added

ERROR: Function pg_last_xact_replay_timestamp() is currently not supported for Aurora

Aren't we breaking stuff for Aurora today anyways? Sorry I haven't been following, but isn't our load balancing code already not compatible with Aurora?

Only load balancing is not working right now and there are multiple instances running GitLab with Aurora (even above 2k users), just with load balancing disabled. We don't want to break those instances by shipping a feature that will have to run on all instances when a background migration is executed.

I would say that if the pg_is_in_recovery is important that we should hold off for a milestone and properly release the feature

@iroussos Can we say why though?

If one is on Aurora, their migration doesn't slow down based on those signals today and won't slow down when we release it, because their database is unsupported. I don't see harm in that, do you?

Even if we don't want to release the feature right away, I don't see why we would need to hold off from implementing this. It's useful for .com and can be behind a feature flag easily (and it can be made optional, see above).

@abrandl We are covered as long as this is not causing errors for instances running with Aurora. That was the purpose of my original note, that we should be careful not to run this query on Aurora or recover from errors; otherwise we should wait.

As long as we do not break those instances, we can address this in any way we deem fit: we can feature flag it, recover form errors or approach it in a different way :-)

Based on https://gitlab.slack.com/archives/CNZ8E900G/p1649371712003299?thread_ts=1649369440.160749&cid=CNZ8E900G (internal) it seems like WAL related functions are not supported on Aurora (maybe unless logical replication is configured), so we'll have to make this it work in a way it's not breaking when done on Aurora instances.

mentioned in commit b766abe2

mentioned in merge request !84555 (merged)

Pushed initial draft MR to implement calculating pending WAL size - !84555 (merged).

Need to check with @abrandl where and how to plug this. We also need to find a reasonable default threshold value, and a way to configure it.

Nice, thanks @krasio! I'll push a few pieces for working with signals next, and then we can plug this in.

We discussed about the configuration aspect on a call today - you called out that we need a reasonable place to store those settings. I wonder if for starters, we could have them as a json object on individual migration records? This way, we can change them if really needed (and on a per migration basis). Maybe that's overengineered, but it would give an easy start as we would just setup some defaults and use that for .com.

Actually the same is true for just hard-coding those thresholds for now and tune those to .com (as long as stuff is behind a feature flag). And before we release this, we add UI and more flexible storage for the settings.

WDYT?

added databaseactive workflowin dev labels

Setting to %15.0 as this is still actively worked on and we may not make it for %14.10. @krasio please feel free to change the milestone.

@iroussos Yes, let's aim for %15.0, we need the rest of what Andreas is working on to plug this, and we also need to make this a bit most robust so it does not when some of the functions used are not supported (e.g. when Aurora is used).

Great, thank you @krasio!

Update: this is still waiting on !85196 (merged), so that we can update !84555 (merged) and make it fit in.

changed milestone to %15.0

mentioned in issue gitlab-org/database-team/team-tasks#247 (closed)

mentioned in commit e7583759

mentioned in issue gitlab-org/database-team/team-tasks#252 (closed)

changed milestone to %15.1

added devopsdata stores label and removed devopssystems label

changed milestone to %15.2

mentioned in issue gitlab-org/database-team/team-tasks#256 (closed)

mentioned in commit 516efd13

mentioned in commit 55c6f994

mentioned in commit 22737129

mentioned in commit 250fe088

mentioned in commit 32ff2847

mentioned in commit d6874b72

mentioned in commit 79af21e6

mentioned in commit 383a9e40

mentioned in commit c775fdf1

mentioned in commit 6cfb7b42

mentioned in commit 24baaffa

mentioned in issue #366855 (closed)

mentioned in commit 0914284a

mentioned in commit 03d57079

Asked DBRE team for an advice on threshold value - https://gitlab.slack.com/archives/C02K0JTKAHJ/p1657064294931049 (internal).

Hi @krasio I don't have a strong opinion on this, but here are my two cents.

When I look at the pending WAL files during the last 12 weeks / last year, it looks like the normal noise floor is below 25. Sometimes it peeks but on most days not reaching 50.

Somewhere between 25 and 50 I would assume we are having more pending than usual. In between them could be a good starting point, let's say 42.

42 * 16 MB = 672 MB on an average day we produce between 30 MB/s and 65 MB/s WAL so this threshold equals approximately 10 s to 22 s of lagging behind with archiving. But after decomposition has finished we are down to between 10 MB/s and 40 MB/s, so we might want to revisit some historical thresholds anyway.

@alexander-sosna Thanks!

mentioned in commit 06ab1b24

mentioned in commit e3e2b62c

Pause batched background migration when WAL pen... (!84555 - merged) is now assigned for review.

added workflowin review label and removed workflowin dev label

mentioned in issue gitlab-org/database-team/team-tasks#259 (closed)

changed milestone to %15.3

@alexives To make the NEXT review smooth, I'm temporarily moving this issue back to %15.2. I'll let you know when it's fine to assign %15.3. See Slack message for context. Thank you!

changed milestone to %15.2

changed milestone to %15.3

added missed:15.2 label

changed milestone to %15.2

changed milestone to %15.3

mentioned in issue gitlab-org/database-team/team-tasks#262 (closed)

mentioned in commit d5f4c04b

mentioned in commit 2b8af23a

mentioned in commit 55fb3d06

mentioned in commit d1853505

added workflowverification label and removed workflowin review label

batched_migrations_health_status_wal feature flag enabled on non-production environments - #366855 (comment 1064400921).

mentioned in commit 4e8bf0ab

Going to enable batched_migrations_health_status_wal on production at 2022-08-24 03:35 UTC.

Done - https://gitlab.slack.com/archives/C101F3796/p1661312126630269 (internal).

After enabling this on production, we can see there 6 cases in the last 24 hours where batched migrations were put on hold because of the WAL pending queue above the threshold - https://log.gprd.gitlab.net/goto/42137bd0-23fb-11ed-b86b-d963a1a6788e.

mentioned in issue gitlab-org/database-team/team-tasks#269 (closed)

changed milestone to %15.4

@krasio - is this done?

I think yes, and we can close. Was wanting to enable the feature flag by default, but for now we can leave it as is, (disabled by default, of type ops), and revisit later of we want to enable by default for self-managed, which will require mostly working on some docs.

closed

added groupdatabase frameworks label and removed groupdatabase [DEPRECATED] label

Pause migration when WAL queue pending archival crossed a threshold

Overview

Designs

Child items ...

Activity

Pause migration when WAL queue pending archival crossed a threshold

Overview

Relates to

Activity