We need to write a post-mortem about what happened with the database, how we recovered it, and what we're going to do in the future to prevent this from ever happening again. This particular blog should mention the following issues (all related):
- infrastructure#1094 (closed)
- infrastructure#1095 (closed)
- infrastructure#1096 (closed)
- infrastructure#1097 (closed)
- infrastructure#1098 (closed)
- infrastructure#1099 (closed)
- infrastructure#1100 (closed)
- infrastructure#1101 (closed)
- infrastructure#1102 (closed)
- infrastructure#1103 (closed)
- infrastructure#1105 (closed)
I'll start writing a draft tomorrow.
Topics to cover (that I can think of):
- Events leading up to the problem (in case one hasn't read the previous post)
- Events that lead to the problem, and how we responded to it at that time
- The recovery procedure
- What kind of data was lost, how much, etc
- The plans for the future (see list of issues above)
Data loss impact from the Google document:
- ±6 hours of data loss
- 4613 regular projects, 74 forks, and 350 imports are lost (roughly); 5037 projects in total. Since Git repositories are NOT lost, we can recreate all of the projects whose user/group existed before the data loss, but we cannot restore any of these projects’ issues, etc.
- ±4979 (so ±5000) comments lost
- 707 users lost potentially, hard to tell for certain from the Kibana logs
- Webhooks created before Jan 31st 17:20 were restored, those created after this time are lost
Is there an estimate of how many webhooks were ultimately lost?
Thank you for the constant updates and transparency during this whole situation. Really shows integrity and responsibility. Thank you to the team members who no doubt stayed late working on getting things resolved. If nothing else, I'm sure this has been a great learning experience! Keep up the great work.
Compliments for the way you guys communicated about the issue. Hope your investors also agree and understand these things can happen.
Thanks A lot. Even If I lost one of my projects (I really do) nothing worried me about that because I feel you're in charge of everything.
Your transparency is an example of how a really open source project must to work.
This will be a great resource once completed. Can you comment on any RPO/RTO you had expectations of before and also after the event if those have changed?
In my case one commit is lost. I'm on the computer from which I had committed these changes, so the current hash code listed in
git logis not listed on gitlab. How should I proceed? Thanks.
This draft will also be included https://docs.google.com/document/u/1/d/1hOzpk3dOhzrlvZH-TrjCGZHC2p_a5Rb6C8mFDcZU0W8/pub
Great work guys. Human errors can happen sometimes. But, I am thrilled to see your transparency while resolving it.
Great work guys, and the transparency is so impressive!
And, I'd like to add my 2 cents here, hope you can consider it.
From the architecture design, the "Backup Mechanism” is NOT the good way to make sure the data no lost. backup is periodic, so, there will be inconsistent data gaps between backup and fault. master-slave or master-master is eventual consistency, the data still would be lost.
here is a slide from Google I/O in 2009 http://snarfed.org/transactions_across_datacenters_io.html
Because the data loss not only just human errors, the data could be lost due to power outage, disk damage, virus or other various ways. And the best practice the recovery the data is not copy the backup, it should be that make the staging node live automatically.
So, I think the correct way should go is the database HA design, if we need > 99.9% HA, we have to design strong consistency mechanism with multiple live nodes. This probably need 2PC, or semi-sync or Paxos for data replication.
@haoel Hi, isn't an effective and efficient backup/restore mechanism the most needed thing?
Ordinary database HA can be easily broken in this incident: trying to setup pgpool and having to shut down many things and finally removing some data.
I think gitlab.com urgently needs some reliably implemented/configured/tested tools.
My wishlist for the postmortem:
- Escalation process to 'all-hands' outage.
- Involve marketing in all-hands outage.
- Use Google Doc for all-hands outage.
- Publish Google Doc during all-hands outage (format of tweet message)
- Process for publishing live stream (using Zoom so everyone can join)
- Process for redacting the chat
- Use employee numbers in the Google Doc instead of full names or initials.
- Have a premade outage page.
- Answer the 5 why's https://en.wikipedia.org/wiki/5_Whys
- Also make backups that we can't delete easily https://cloud.google.com/storage-nearline/
- Every backup we make should be checked
- Make sure we have complete backups (database + git repo's)
- Make sure that the complete backup has multiple versions (1 hour, 6 hours, 24 hours, etc.)
- Automatically restore a complete backup at least every week
- Be able to restore the complete backup to a different cloud region
- Manually test restoring a complete backup at least every month (vary the time of the backup used and the cloud region)
@casdr Thanks! Now I'm not an overly experienced git user and don't want to make things more complicated, so pardon my potentially trivial question, please:
git push origin mastergives me
Everything up-to-dateof course, for my git is unaware of the lost commit at gitlab. How can I make that lost commit be pushed to gitlab then?
@theZcuber No, not in terms of counts. But any web hooks created after Jan 31st 17:20 UTC and before the outage are lost.
To the others mentioning HA and what not, this is being worked on:
- infrastructure#259 (closed)
- gitlab-org/omnibus-gitlab!1251 (closed)
- infrastructure#1105 (closed)
- Using wal-e was also suggested, which we'll be looking into as well
Please keep discussions related to these topics in the above issues (and the ones mentioned in the issue) body, that makes it easier to keep track of things.
added ~1491900 labelToggle commit list