Corrective action: WAL-G Turbo mode needs to be used on restore command to not fall further behind
Summary
During the incident, a replica fell behind the leader and when we needed to have it catch up, the WAL-G --turbo
flag was required to speed up the restore of WAL (transaction log) files from the WAL-G GCS archive location. Ideally we should add the --turbo
flag for any documentation or scripts that performs a wal-g wal-fetch
or wal-g backup-fetch
as it was not common knowledge for the EOC.
It's a common practice to throttle database backup operations to avoid performance impact on database nodes during backup tasks, however database restore operations should execute as fast as possible to avoid node unavailability or projects delay, hence we should be safe to explicitly remove throttling from wal-g wal-fetch
or wal-g backup-fetch
operations.
WAL-G's --turbo
flag is available to "Ignore all kinds of throttling defined in config". Code References:
- https://github.com/wal-g/wal-g/blob/master/cmd/pg/pg.go#L53
- https://github.com/wal-g/wal-g/blob/master/internal/config.go#L637
- https://github.com/wal-g/wal-g/blob/master/internal/configure.go#L111
Related Incident(s)
Originating issue(s): production#7250 (closed)
Desired Outcome/Acceptance Criteria
At a minimum, turbo flags are documented in the runbook as a process for helping catch a replica back up with walg. Also, any scripts we use to help facilitate this that could use this flag should be updated.
Associated Services
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose out of -
Give context for what problem this corrective action is trying to prevent from re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'priority::4')