Indicator: Number of WAL segments pending archival
Source: Query primary (see below)
Action: Pause migration if pending wal queue length is greater then configurable threshold
Parameter: threshold
Needs prometheus: No
WAL archival is crucial to database recovery objectives and potentially also availability upon failover scenarios (depending on implementation).
Once a certain threshold (configurable, defaults to what's necessary for GitLab.com) is crossed, we stop data migrations to let the system recover and work on the backlog of WAL segments before the migration resumes.
The number of WAL segments pending archival can be retrieve with below query or through prometheus.
@krasio Umh, so I "stole" this from here, which is what we're currently using to report the "number of WAL files pending archival" metric to prometheus.
Since this worked (and you are right about the PG10 rename), I was curious and found we seem to have some sort of backwards compatibility going on still:
gitlabhq_production=> \df pg_current_xlog_location List of functions Schema | Name | Result data type | Argument data types | Type --------+--------------------------+------------------+---------------------+------ public | pg_current_xlog_location | pg_lsn | | func(1 row)CREATE OR REPLACE FUNCTION public.pg_current_xlog_location() RETURNS pg_lsn LANGUAGE sql STABLEAS $function$ SELECT pg_current_wal_lsn(); $function$
I'm not aware this came through gitlab-rails, so I suspect this is .com specific. Thinking it may be time to clean this up, although we would first have to adapt the monitoring queries accordingly.
@alexander-sosna FYI - do you happen to know more context around this?
Note: We should check if any of those functions are not supported by Amazon Aurora. I know that at least pg_is_in_recovery is not supported, so this could be a blocker.
We should not release this until #342093 (closed) is completed and we add a release post announcement with the consensus from gitlab-org/gitlab#342542. I expect this to be done mid next milestone, so we will be able to add this starting in %15.0.
Edit: maybe %15.1 if we want to allow those instances to upgrade to %15.0 before switching Databases. We'll have to think about that
@iroussos Thanks, will keep this in mind. I think we may be even able to skip the pg_is_in_recovery part - if we're trying to execute background migration and the primary database is in recovery (not sure if both these can happen at the same time) we probably have bigger problems.
Also we (at least I) do not know the exact implementation we gonna have, as @abrandl mentioned in one of the related issues that if such signals fail we'll just ignore them and move on:
In any case, we'll implement this signal checking as optional - in case of a lack of permissions, we simply won't have this signal to work with (and don't error out).
Right @krasio - let's implement this fail-safe and carry on in case we are unable to get an indicator.
@iroussos Aren't we breaking stuff for Aurora today anyways? Sorry I haven't been following, but isn't our load balancing code already not compatible with Aurora?
[..] let's implement this fail-safe and carry on in case we are unable to get an indicator.
I would say that if the pg_is_in_recovery is important that we should hold off for a milestone and properly release the feature. I am personally not sure if other functions like last_archived_wal are also not supported, but @mattkasa could check the query while working on #342093 (closed) if we really want to get this out.
Or we should catch and recover from unknown functions, but this may be tricky and too much work added
ERROR: Function pg_last_xact_replay_timestamp() is currently not supported for Aurora
Aren't we breaking stuff for Aurora today anyways? Sorry I haven't been following, but isn't our load balancing code already not compatible with Aurora?
Only load balancing is not working right now and there are multiple instances running GitLab with Aurora (even above 2k users), just with load balancing disabled. We don't want to break those instances by shipping a feature that will have to run on all instances when a background migration is executed.
If one is on Aurora, their migration doesn't slow down based on those signals today and won't slow down when we release it, because their database is unsupported. I don't see harm in that, do you?
Even if we don't want to release the feature right away, I don't see why we would need to hold off from implementing this. It's useful for .com and can be behind a feature flag easily (and it can be made optional, see above).
@abrandl We are covered as long as this is not causing errors for instances running with Aurora. That was the purpose of my original note, that we should be careful not to run this query on Aurora or recover from errors; otherwise we should wait.
As long as we do not break those instances, we can address this in any way we deem fit: we can feature flag it, recover form errors or approach it in a different way :-)
Nice, thanks @krasio! I'll push a few pieces for working with signals next, and then we can plug this in.
We discussed about the configuration aspect on a call today - you called out that we need a reasonable place to store those settings. I wonder if for starters, we could have them as a json object on individual migration records? This way, we can change them if really needed (and on a per migration basis). Maybe that's overengineered, but it would give an easy start as we would just setup some defaults and use that for .com.
Actually the same is true for just hard-coding those thresholds for now and tune those to .com (as long as stuff is behind a feature flag). And before we release this, we add UI and more flexible storage for the settings.
@iroussos Yes, let's aim for %15.0, we need the rest of what Andreas is working on to plug this, and we also need to make this a bit most robust so it does not when some of the functions used are not supported (e.g. when Aurora is used).
Hi @krasio I don't have a strong opinion on this, but here are my two cents.
When I look at the pending WAL files during the last 12 weeks / last year, it looks like the normal noise floor is below 25.
Sometimes it peeks but on most days not reaching 50.
Somewhere between 25 and 50 I would assume we are having more pending than usual.
In between them could be a good starting point, let's say 42.
42 * 16 MB = 672 MB on an average day we produce between 30 MB/s and 65 MB/s WAL so this threshold equals approximately 10 s to 22 s of lagging behind with archiving. But after decomposition has finished we are down to between 10 MB/s and 40 MB/s, so we might want to revisit some historical thresholds anyway.
@alexives To make the NEXT review smooth, I'm temporarily moving this issue back to %15.2. I'll let you know when it's fine to assign %15.3. See Slack message for context. Thank you!
I think yes, and we can close. Was wanting to enable the feature flag by default, but for now we can leave it as is, (disabled by default, of type ops), and revisit later of we want to enable by default for self-managed, which will require mostly working on some docs.