[Meta] Disaster recovery for everything that is not the database

I think we can move to Yubikey once we get a full remote-access solution (Teleport? VPN?) stood up. Until then we're kinda stuck.

Directly related: https://gitlab.com/gitlab-com/infrastructure/issues/1163

@briann VPN sounds like the boring solutions to me

@pcarranza that links to this issue

Fixed, thanks @sytses

@sytses @briann even though VPN is boring, it only provides one thing, and we are trying to tackle a bunch of other issues we are seeing around, for example:

Adding to this there are some other ideas that bring value like Having all traffic come through the HA teleport proxies means easier monitoring of in-bound traffic for suspicious logins and It will allow us to separate git sessions from admin sessions at the network level

So, VPN is nice, but it's just one side of the problem, there are much more things that we need to consider here.

I agree that VPN is the boring solution, however it doesn't solve the problems @pcarranza mentioned and it also requires standing up internal DNS and CA services. I was really hoping we could get Teleport deployed as it would be much simpler and accomplish many more of my goals.

mentioned in issue #1188 (closed)

added critical and removed ~366505 labels

@briann do you have a link to Teleport?

@Sytse here it is https://gravitational.com/teleport/

@briann I will need you help with the security side of things to have a plan of how the fleet should look like and how do we get ready and handle security incidents.

Related from the backup perspective: https://gitlab.com/gitlab-com/infrastructure/issues/1126

Here is my overview plan for the backup bullet points. All of these will of course need associated detailed runbooks.

Git files gone: one thing is having snapshots, a different thing is having a backup that we can recover from.
- See my comment on file backups at https://gitlab.com/gitlab-com/infrastructure/issues/1126#note_23672298
- Snapshots for secondary
Shared files gone: same case as the previous - we need to have a way of recovering from this.
- Same as above, https://gitlab.com/gitlab-com/infrastructure/issues/1126#note_23672298
Multiple forms of backups for the database including disk snapshotting for quick recovery.
- This is being dealt with at minimum by WAL-E. WAL-E will allow us to revert to any minute within the past 5 days. This can and will be able to be applied to any Postgres server via Chef, not just the main cluster database.
- Disk snapshots for the database disks.
Secondary servers (version, etc) disaster recovery plan.
- There is a larger issue for this at https://gitlab.com/gitlab-com/infrastructure/issues/1126
- Most of the above mentioned servers only have databases that need backups and those can be covered with WAL-E
Production readiness checklist, including disaster recovery - nothing gets in production if we don't have a backup plan.
- Readiness checklist needs iterating, but my initial plan is:
  - Any request for a production server must include a backup plan.
  - This plan should be presented at the time a server is requested.
  - We must not build any production servers without this plan present.
  - Additionally not entirely backup related, each production service should not be deployed without runbooks.
Assign a data durability owner that will be responsible for making sure that things work.
- That's me!

I investigated backups for dev.gitlab.org as I wanted to verify they were actually created. Those backups I found are only the database as per https://gitlab.com/gitlab-com/infrastructure/issues/33. dev will need files and database backups. We can use WAL-E for database backups and the thoughts on file backups are still at https://gitlab.com/gitlab-com/infrastructure/issues/1126#note_23672298

mentioned in issue #33 (closed)

Here are the relevant issues to complete the above tasks:

@ahanselka can you edit the description and add the issues to the tasks please?

@pcarranza I've added the issues next to the their relevant tasks.

Can you please clarify number 5 in the list (pushing a package that causes loss)? What kind of loss are we talking about here? Production data loss or just files from the packages bucket?

I don't really get point 5.

After some discussion within team we got to the following question: How do we prevent people from downloading a faulty package?

@marin could you share some light on how to do this please?

@pcarranza Package server will have checksums for all packages that are in there. On the user side package manager will compare the checksums of the downloaded package and make a decision. When users add our repository for the first time they also fetch the gpg key for this repository. From then on if package has the correct checksum, package manager will trust it has a correct file. Lose the keys, get a problem.

Step further is to have individual package signature but that is a different topic.

Thanks @marin what I hear is that we don't have to worry about the package getting corrupted in transport in any way.

Do we have a way of removing/banning a faulty package from the package server?

what I hear is that we don't have to worry about the package getting corrupted in transport in any way.

I wouldn't go that far. I would just state that we are reasonably ok as long as we don't loose the keys and that in general there is a way to further improve this.

I would just state that we are using a Rails app (codebase we do not own) on an Ubuntu 14.04 server so there is a more immediate worry than whether the package gets corrupted in transfer(transfer also goes through HTTPS from S3).

Do we have a way of removing/banning a faulty package from the package server?

Removing a package from package server is very simple, can be done through the interface or using CLI with appropriate gem.

@marin I get it now, what we want is a SCRAM button.

This => https://gitlab.com/gitlab-com/infrastructure/issues/1264

@pcarranza I added "1. Massive DDOS attack" to the list.

@sytses that's not a disaster recovery concern but an availability one. Still I opened an issue about it to discuss as we currently do not have any defense line for that.

removed critical label

Removing critical label because we have issues to track specifics.

marked the checklist item 12. Assign a data durability owner that will be responsible for making sure that things work. (gitlab-com/www-gitlab-com!4950 (merged)) as completed

marked the checklist item 5. Pushing a package that causes data loss: detection and a measured response time. (https://gitlab.com/gitlab-com/infrastructure/issues/1264) as completed

closed

reopened

marked the checklist item 6. Multiple forms of backups for the database including disk snapshotting for quick recovery (https://gitlab.com/gitlab-com/infrastructure/issues/1152 - https://gitlab.com/gitlab-com/infrastructure/issues/1251) as completed

mentioned in issue #1684 (closed)

We are adding disaster recovery to the package cloud server here

changed the description

marked the checklist item 4. Package server hacked: recover and confirm that no package has been modified - (https://gitlab.com/gitlab-com/infrastructure/issues/1252) as completed

added meta label

@pcarranza This issue can probably use some updates?

marked the checklist item 11. Production readiness checklist, including disaster recovery - nothing gets in production if we don't have a backup plan. (https://gitlab.com/gitlab-com/infrastructure/issues/1240) as completed

changed the description

added goal label

changed title from Disaster recovery for everything that is not the database to [Meta] Disaster recovery for everything that is not the database

changed the description

marked the checklist item 7. Staging access should be only available for people with production access. (gitlab-com/www-gitlab-com!5123 (merged) - https://gitlab.com/gitlab-com/infrastructure/issues/1231) as completed

You are right @ernstvn

Some updates:

Git files gone: one thing is having snapshots, a different thing is having a backup that we can recover from

We currently have disk snapshots stored in a different region, and they have been used to recover data a couple of times already, it's not the greatest, but it works.

The fix: https://gitlab.com/gitlab-com/infrastructure/issues/1606
The verification: https://gitlab.com/gitlab-com/infrastructure/issues/1520
The rocumentation: https://gitlab.com/gitlab-com/infrastructure/issues/1760
The restore https://gitlab.com/gitlab-com/infrastructure/issues/1991
The continuous validation as part of the backup appreciation day: https://gitlab.com/gitlab-com/infrastructure/issues/2166

So I will change this out of bacula and consider it done.

Shared files gone: same case as the previous - we need to have a way of recovering from this.

Closed together with the previous one.

Hacker in system, need to reset all credentials: we should move all our credential handling into Vault (https://gitlab.com/gitlab-com/infrastructure/issues/1212)

I would like to take a second look at this now that we have @ilyaf around. Could you agree with @briann on what the right steps are here? I'm not sure we have any means for identifying this situation, and that would be a good solid first step.

Remove production access from release managers.

I would like to review how we can get here with @jameslopez and @omame. For now we don't have a way of getting here anytime soon.

All production engineers use Yubikey for enhanced security (https://gitlab.com/gitlab-com/infrastructure/issues/1250)

We should just do it. I'll work on the list and chase the people to get it done.

Secondary servers (version, etc) disaster recovery plan. (https://gitlab.com/gitlab-com/infrastructure/issues/1239 and https://gitlab.com/gitlab-com/infrastructure/issues/1221)

@ahanselka correct me if I'm wrong, but I think that we, in fact, have this covered with point 1.

Azure Region Down (https://gitlab.com/gitlab-com/infrastructure/issues/1285)

Also covered with point 1, we are storing the snapshots in a different region, and the backups in Azure and a different provider (AWS), so we can recover, it will not be fast, but we can recover.

changed the description

marked the checklist item 14. Azure Region Down (https://gitlab.com/gitlab-com/infrastructure/issues/1285) as completed

@pcarranza actually, the snapshots are stored in the same region. I'm unsure if we even can store them in a different region, and if so we would need to investigate the time that it would add to a restore, if any.

@rspeicher do you happen to remember off hand whether there was an option for the region to create the snapshot in?

@pcarranza @rspeicher: @ilyaf and I just tried to create a disk from a snapshot in a different region and it does not work. Additionally, I don't think it is possible to create a snapshot in a different region either. In order to recover from a Azure regionwide disaster, we are going to need to come up with a different strategy as the disk snapshots won't work.

The concept does exist in the Azure documentation, just need to see what these commands translate out to in API calls: https://docs.microsoft.com/en-us/azure/storage/scripts/storage-linux-cli-sample-copy-snapshot-to-storage-account

changed the description

marked the checklist item 14. Azure Region Down (https://gitlab.com/gitlab-com/infrastructure/issues/1285) as incomplete

Ok, I was wrong then.

Could we ping Tarun for support on this?

do you happen to remember off hand whether there was an option for the region to create the snapshot in?

@ahanselka Yep, we set it on the Snapshot (defined here).

Even if we do set it, we still may not be able to create cross region snapshots, and we definitely cannot create a disk from a snapshot in another region.

@ahanselka that's not true, the link that I posted show how Azure creates disks in other regions from snapshots.

Apologies, @northrup. I just skimmed it from my phone. I'll check it out tomorrow.

mentioned in issue #1285 (closed)

mentioned in issue #811 (closed)

removed milestone

assigned to @jarv and unassigned @pcarranza

I'm not seeing anything in here that needs to be addressed pre-migration, if anyone on the cc disagrees please feel free to comment and reopen.

closed

[Meta] Disaster recovery for everything that is not the database

Designs

Child items ...

Activity