Project 'gitlab-com/infrastructure' was moved to 'gitlab-com/gl-infra/production-engineering'. Please update any links and bookmarks that may still have the old path.
[Meta] Disaster recovery for everything that is not the database
@sytses@briann even though VPN is boring, it only provides one thing, and we are trying to tackle a bunch of other issues we are seeing around, for example:
Adding to this there are some other ideas that bring value like Having all traffic come through the HA teleport proxies means easier monitoring of in-bound traffic for suspicious logins and It will allow us to separate git sessions from admin sessions at the network level
So, VPN is nice, but it's just one side of the problem, there are much more things that we need to consider here.
I agree that VPN is the boring solution, however it doesn't solve the problems @pcarranza mentioned and it also requires standing up internal DNS and CA services. I was really hoping we could get Teleport deployed as it would be much simpler and accomplish many more of my goals.
@briann I will need you help with the security side of things to have a plan of how the fleet should look like and how do we get ready and handle security incidents.
Multiple forms of backups for the database including disk snapshotting for quick recovery.
This is being dealt with at minimum by WAL-E. WAL-E will allow us to revert to any minute within the past 5 days. This can and will be able to be applied to any Postgres server via Chef, not just the main cluster database.
@pcarranza I've added the issues next to the their relevant tasks.
Can you please clarify number 5 in the list (pushing a package that causes loss)? What kind of loss are we talking about here? Production data loss or just files from the packages bucket?
@pcarranza Package server will have checksums for all packages that are in there. On the user side package manager will compare the checksums of the downloaded package and make a decision. When users add our repository for the first time they also fetch the gpg key for this repository. From then on if package has the correct checksum, package manager will trust it has a correct file. Lose the keys, get a problem.
Step further is to have individual package signature but that is a different topic.
what I hear is that we don't have to worry about the package getting corrupted in transport in any way.
I wouldn't go that far. I would just state that we are reasonably ok as long as we don't loose the keys and that in general there is a way to further improve this.
I would just state that we are using a Rails app (codebase we do not own) on an Ubuntu 14.04 server so there is a more immediate worry than whether the package gets corrupted in transfer(transfer also goes through HTTPS from S3).
Do we have a way of removing/banning a faulty package from the package server?
Removing a package from package server is very simple, can be done through the interface or using CLI with appropriate gem.
@sytses that's not a disaster recovery concern but an availability one. Still I opened an issue about it to discuss as we currently do not have any defense line for that.
marked the checklist item 12. Assign a data durability owner that will be responsible for making sure that things work. (gitlab-com/www-gitlab-com!4950 (merged)) as completed
Pablo Carranza [GitLab]changed title from Disaster recovery for everything that is not the database to [Meta] Disaster recovery for everything that is not the database
changed title from Disaster recovery for everything that is not the database to [Meta] Disaster recovery for everything that is not the database
Git files gone: one thing is having snapshots, a different thing is having a backup that we can recover from
We currently have disk snapshots stored in a different region, and they have been used to recover data a couple of times already, it's not the greatest, but it works.
I would like to take a second look at this now that we have @ilyaf around. Could you agree with @briann on what the right steps are here? I'm not sure we have any means for identifying this situation, and that would be a good solid first step.
Remove production access from release managers.
I would like to review how we can get here with @jameslopez and @omame. For now we don't have a way of getting here anytime soon.
Also covered with point 1, we are storing the snapshots in a different region, and the backups in Azure and a different provider (AWS), so we can recover, it will not be fast, but we can recover.
@pcarranza actually, the snapshots are stored in the same region. I'm unsure if we even can store them in a different region, and if so we would need to investigate the time that it would add to a restore, if any.
@rspeicher do you happen to remember off hand whether there was an option for the region to create the snapshot in?
@pcarranza@rspeicher: @ilyaf and I just tried to create a disk from a snapshot in a different region and it does not work. Additionally, I don't think it is possible to create a snapshot in a different region either. In order to recover from a Azure regionwide disaster, we are going to need to come up with a different strategy as the disk snapshots won't work.
Even if we do set it, we still may not be able to create cross region
snapshots, and we definitely cannot create a disk from a snapshot in
another region.