Project 'gitlab-com/infrastructure' was moved to 'gitlab-com/gl-infra/production-engineering'. Please update any links and bookmarks that may still have the old path.
Our current alerting for backups (the ones that go to S3 I believe) involves sending Emails. This isn't working due to DMARC not being set up. Furthermore, these Emails may go to an inbox one might not actively check.
Backup alerts should instead use Prometheus alerts and be sent to Slack. These alerts should be triggered by some process that periodically checks (e.g. every hour) if we have all the right backups in place. Alerts should be sent when:
The most recent backup is smaller than 300 GB (our current DB size)
The last backup is more than N hours old (this depends a bit on how frequently we're running backups)
The backup process failed for whatever reason. The alert should include the full output, or alternative store it somewhere and include a link in the output
Backup monitoring should also produce graphs in Grafana, showing data such as the number of backups, size of the last backup, etc.
This should also include monitoring for LVM snapshots, and Azure disk snapshots.
@maratkalibek I have randomly assigned this to you using a Ruby script. Feel free to re-assign this to somebody else if you're unable to take care of it.
if i might also suggest, in addition to alerts being sent to your slack channel, it should probably also send sms/texts to db admin principals. this could maybe be done using something like IFTTT. just a suggestion. thank you for the openess, transparencey and getting things back up, it sounds pretty hairy over all. glad you're through it.
@felipecalmeida Thanks for the suggestion. Since we already have Prometheus alerting and Pagerduty in place (which can send SMS), we'll probably go with those two :)
@yorickpeterse @felipecalmeida we just pagerduty, that would take care of alerting however it is necessary, no need to reimplement the wheel, we already have one.
I generally say slack because that's the first place where they go, but we have channels for pagerduty, so we would be alerting to both targets.
For #2 (closed) "The last backup is more than N hours old", will it be better changing to "The last SUCCESSFUL backup is more than N hours old"?
Backup status should be recorded. On the other hand, backup tasks may exit or be killed without notice.
One step in the backup chain is the application provided "gitlab-rake backup" task. https://gitlab.com/gitlab-org/gitlab-ce/issues/27434 mentions the use of prometheus metrics as one mechanism to provide visibility and monitoring of this particular step.
@yorickpeterse@ahanselka@maratkalibek the more I read about streaming physical backups, the more I think we will need to work on monitoring this streaming and how much data are we storing in the storage endpoint.
Let me define how does it look to have a solid disaster recovery and then we can talk about this.
@elygre we can't use that rake task because of the size of GitLab.com, we need to move to a streaming backup module to drop to 0 loss in case of a disaster recovery, so we are not going to consider improving that task for the time being.
Maybe over time we can export whatever we do in infrastructure to the application package, but this is not something that we worry about right now.
@pcarranza Yes, that makes perfect sense, and good to hear. I guess I'm just trying to get as much attention as possible on the other issue: For an on-premise installation, the rake task is what we got, and we, too, need working backups. (That issue was created when I discovered that our operations desk did failed to discover that the rake task had not worked for a month-and-a-half; we really need better visibility into the problems).
@maratkalibek
This rule only fires if the age of the backup is more than x seconds old. The rule assumes that the data source for the backup exists (which it won't if it stops running...or gets renamed). In other words the current alert logic is backwards. You want to know that you have a recent backup. Right now you are checking that you don't have any old backups.
IF gitlab_com:last_backup_age_in_seconds >= 36*60*60
should be something like
IF absent(gitlab_com:last_backup_age_in_seconds) or (gitlab_com:last_backup_age_in_seconds >= 36 * 60 * 60)
According to your rule I have a successful backup of gitlab.com locally on my laptop
Just to clarify something. By default the prometheus query will only look at the last hour [1h] of data. Meaning that if there is no data available for the last hour (because the backup is failing) the query won't return anything.
We are currently changing our database backups to Wal-E
This means that our backup will be transformed to a near zero loss continuously streamed to an S3 bucket and an Azure blob. If you want to track progress you can do it in #1152 (closed)"
But #1152 (closed) was about setting up WAL-E, now completed. Looks like we need to revisit this issue to have some monitoring in place. @maratkalibek is this you, or @ahanselka does this fit with the work you've been doing on WAL-E?
@ernstvn I believe @maratkalibek was working on backup monitoring, and I am unsure the status. I am not entirely sure, but I suspect it got sidetracked due to our limited resources at the moment.
It is still relevant even after our streaming backups. We will just be checking for something different.
There are a couple of ways to do this, probably more roundabout than you'd like but also compatible with more archivers.
One is to monitor the postgres archive_status directory, e.g. with inotify. It has a simple naming scheme around "
NAME.ready"and"
NAME.done" segments.
That's a bit roundabout, but would it work?
@bjk-gitlab how do you feel about monitoring done segments? that would just be a count (or age) of files that are marked as "done". Would that make sense to pull something like this into the node exporter?
There's another possibility which is using the logs themselves, which would mean using mtail, I'll pull the line of code that we could be trapping for.
Yes, typically we would expose something like "last_done_backup_file_age_seconds" as a unix seconds timestamp. Then we can do time() - last_done_backup_file_age_seconds to determine how old the last backup file is.
So, the cheapest implementation is to parse the logs with mtail and look for STRUCTURED: time=2017-06-21T16:06:15.082310-00 pid=11041 action=push-wal key=s3://bla/db1/wal_005/00000002000017CC00000026.lzo prefix=db1/ rate=16062.4 seg=00000002000017CC00000026 state=complete to use a counter.
During our fun B.A.D. times, @ilyaf discovered that Azure snapshots were broken and had been so since June 17th. After some digging, it turns out that the snapshot script didn't have exception handling for a server that didn't have managed disk snapshots. We've fixed that in gitlab-cookbooks/gitlab-backup!13 (merged), but we need to be sure that we monitor for this in case something similar happens. Since the resource group was being created and we count the number of resource groups that exist to clean up, we stopped having any valid backups after June 24th.
The workaround for this would be setting the envvar WALE_LOG_DESTINATION to point to a different logfile and directory, so we could actually separate this.
The problem is that, as far as I've seen, WAL-E does not support logging to a file: