Prometheus monitoring for backups

changed title from Prometheus alerting for S3 backups to Prometheus monitoring for S3 backups

added ~771181 label

This should also involve other backups (e.g. LVM snapshots), if possible. I'll change the title.

changed title from Prometheus monitoring for S3 backups to Prometheus monitoring for backups

@maratkalibek I have randomly assigned this to you using a Ruby script. Feel free to re-assign this to somebody else if you're unable to take care of it.

assigned to @maratkalibek

So to be crystal clear, we'd need to check the following:

Are our backups in S3?
Are our S3 backups the right size?
Do we have recent LVM snapshots for the production database(s)?
Do we have recent Azure disk snapshots for the production database(s)?

We'd need to plot the following:

The size of the most recent backup
Backup run time
Successful runs vs errors
Anything else? (cc @pcarranza)

if i might also suggest, in addition to alerts being sent to your slack channel, it should probably also send sms/texts to db admin principals. this could maybe be done using something like IFTTT. just a suggestion. thank you for the openess, transparencey and getting things back up, it sounds pretty hairy over all. glad you're through it.

@faddah Good call, we can probably hook up some kind of Pagerduty alert as well.

Hi all,

As @faddah said about sms/text, if I might also suggest @yorickpeterse Telegram.

It doesn't use sms (no costs I think) and it's similar to WhatsApp.

I don't know how this works, but here some examples:

@felipecalmeida Thanks for the suggestion. Since we already have Prometheus alerting and Pagerduty in place (which can send SMS), we'll probably go with those two :)

@yorickpeterse @felipecalmeida we just pagerduty, that would take care of alerting however it is necessary, no need to reimplement the wheel, we already have one.

I generally say slack because that's the first place where they go, but we have channels for pagerduty, so we would be alerting to both targets.

Assigned to @maratkalibek @ahanselka

For #2 (closed) "The last backup is more than N hours old", will it be better changing to "The last SUCCESSFUL backup is more than N hours old"? Backup status should be recorded. On the other hand, backup tasks may exit or be killed without notice.

One step in the backup chain is the application provided "gitlab-rake backup" task. https://gitlab.com/gitlab-org/gitlab-ce/issues/27434 mentions the use of prometheus metrics as one mechanism to provide visibility and monitoring of this particular step.

@yorickpeterse @ahanselka @maratkalibek the more I read about streaming physical backups, the more I think we will need to work on monitoring this streaming and how much data are we storing in the storage endpoint.

Let me define how does it look to have a solid disaster recovery and then we can talk about this.

@elygre we can't use that rake task because of the size of GitLab.com, we need to move to a streaming backup module to drop to 0 loss in case of a disaster recovery, so we are not going to consider improving that task for the time being.

Maybe over time we can export whatever we do in infrastructure to the application package, but this is not something that we worry about right now.

@pcarranza Yes, that makes perfect sense, and good to hear. I guess I'm just trying to get as much attention as possible on the other issue: For an on-premise installation, the rake task is what we got, and we, too, need working backups. (That issue was created when I discovered that our operations desk did failed to discover that the rake task had not worked for a month-and-a-half; we really need better visibility into the problems).

@elygre you are absolutely right here.

cc/ @stanhu

added ~1150316 label

I'll do it the help of textfile collector functionality in node_exporter.

I'm taking some time to investigate where it will be better to put those metrics generation.

/cc @ahanselka

https://gitlab.com/gitlab-com/runbooks/blob/master/alerts/gitlab-backup.rules - here is the alert when backup is taken more than 36 hours (1 day + 12h) ago.

Initial graph for backups - https://performance.gitlab.net/dashboard/db/gitlab-com-backups

@maratkalibek I moved the dashboard to https://performance.gitlab.net/dashboard/db/backups

Question:

Does this mean that our newest backup is over 60 days old? how should I read this graph?

Ah, nevermind, the age is in seconds but it was transformed, I changed this to be expressed in time.

sorry, was not so fast in answering you. thank you for corrections.

removed ~1150316 label

@maratkalibek This rule only fires if the age of the backup is more than x seconds old. The rule assumes that the data source for the backup exists (which it won't if it stops running...or gets renamed). In other words the current alert logic is backwards. You want to know that you have a recent backup. Right now you are checking that you don't have any old backups.

IF gitlab_com:last_backup_age_in_seconds >= 36*60*60

should be something like

IF absent(gitlab_com:last_backup_age_in_seconds) or (gitlab_com:last_backup_age_in_seconds >= 36 * 60 * 60)

According to your rule I have a successful backup of gitlab.com locally on my laptop

Just to clarify something. By default the prometheus query will only look at the last hour [1h] of data. Meaning that if there is no data available for the last hour (because the backup is failing) the query won't return anything.

removed assignee

Is this issue no longer relevant with #1152 (closed) ?

The link above, https://performance.gitlab.net/dashboard/db/backups, shows a banner that says

"Backups Monitoring

We are currently changing our database backups to Wal-E This means that our backup will be transformed to a near zero loss continuously streamed to an S3 bucket and an Azure blob. If you want to track progress you can do it in #1152 (closed)"

But #1152 (closed) was about setting up WAL-E, now completed. Looks like we need to revisit this issue to have some monitoring in place. @maratkalibek is this you, or @ahanselka does this fit with the work you've been doing on WAL-E?

mentioned in issue #1684 (closed)

@ernstvn I believe @maratkalibek was working on backup monitoring, and I am unsure the status. I am not entirely sure, but I suspect it got sidetracked due to our limited resources at the moment.

It is still relevant even after our streaming backups. We will just be checking for something different.

@pcarranza where does this fit in the planning?

added backup label

@ernstvn this fits right after we are confident that we are taking backups and we can restore them (which is what we are doing manually).

Somewhat real-time monitoring is great to have, but it should not be a higher priority than actually having those backups working.

@pcarranza ping, I think we have the backups working, so time to revisit this?

Yes, this is becoming more and more critical.

But I think we have a dupe with https://gitlab.com/gitlab-com/infrastructure/issues/1199

I'll review this myself to see what can we do, as Wal-E is Python, I can take on it personally.

changed milestone to %WoW ending 2017-06-27

I'm starting to investigate Wal-E code to see where can we attach monitoring to it.

assigned to @pcarranza

Pulling from https://github.com/wal-e/wal-e/issues/309#issuecomment-286204828

There are a couple of ways to do this, probably more roundabout than you'd like but also compatible with more archivers.

One is to monitor the postgres archive_status directory, e.g. with inotify. It has a simple naming scheme around "
NAME.ready" and "
NAME.done" segments.

That's a bit roundabout, but would it work?

@bjk-gitlab how do you feel about monitoring done segments? that would just be a count (or age) of files that are marked as "done". Would that make sense to pull something like this into the node exporter?

There's another possibility which is using the logs themselves, which would mean using mtail, I'll pull the line of code that we could be trapping for.

Yes, typically we would expose something like "last_done_backup_file_age_seconds" as a unix seconds timestamp. Then we can do time() - last_done_backup_file_age_seconds to determine how old the last backup file is.

We can also probably do something with the logs.

This is the line of code that gets printed when we upload a WAL segment: https://github.com/wal-e/wal-e/blob/master/wal_e/worker/upload.py#L66-L71

We could be just counting these lines.

Last case if we want to not change Wal-E

Prometheus monitoring for backups

Designs

Child items ...

Activity