Include DB backup job monitoring, alerting, and remediation & automated restoration testing procedures on Database handbook page
The purpose of this issue is to documented suggested additions to https://about.gitlab.com/handbook/engineering/infrastructure/database/
Back-up Job Logging & Monitoring -
- How are we capturing back-up job status? success/failure
- How do we evidence completeness of daily full snaphot upon success?*
- Upon failure, is team notified of failure? how? is this automated?
- What criteria are used to address failure? i.e. in what circumstances is job re-run? not re-run?
Rationale for inclusion on database page: Having the above procedures documented and followed will help us evidence compliance with standard security framework control activities related to Computer Operations/Operations Security. Additionally, it will ease facilitation of evidence gathering in the event of a future audit.
*This is a question that will be asked by auditors so it is advantageous to include information on how we evidence completeness of data set in addition to simply relying on success message.
Automated Restoration Testing -
- How often is a test performed? include frequency
- How is test logged?
- How is success/failure evidenced?
- Who performs review of completed test? include frequency?
Procedures should include the capturing of each of these automated tests and evidence of their success or failure. I would suggest that this be captured in a log that could be reviewed by Infrastructure Management on a recurring basis.
Rationale for inclusion on database page: documentation and supporting evidence that you are carrying out these automated validations ("restorations are possible and are working") supports GitLab's Disaster Recovery/Business Continuity posture as well as can serve as evidence of compliance with routine Computer Operations control activities.