We used to do this on a daily basis, see the thread in chat, but it was done with a one-off script. @northrup points out that "when we refactored the fleet we didn't restore that function to a worker as we split and separated them."
As far as I understand, the reason to do this is not because we need the data per sé, but because various tests and deploys with reverts on staging can end up leading to hysteresis: the database ends up in a state that is no longer representative of the state from which it will be expected to start upon the next deploy.
@ahanselka in your role of data durability enforcer, and @yorickpeterse in your database specialist role, can either of you volunteer for this / take it on?
I'd like to see what @northrup and @omame thoughts are on this. It seems to be a good idea, combine two things into one. However, I think we would still want a special test host as well.
That's really important for me as a developer. Staging is the place where I can test the performance of SQL queries on a dataset that is at least similar to production. If this dataset is far behind the production database then the results are no longer representative. /cc @smcgivern
Some features assume recency, so they don't really make sense in staging as it is now. Anything to do with user activities is a good example here.
When a feature adds something to the DB, but will only do so when used, we won't get representative data in staging. We will get some data, but not all. An example here would be https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/12047 - once that is deployed, the table will be there, but staging will have an empty or nearly-empty table, while in production this table can get into millions of rows very quickly.
It seems like this is currently blocked on some other things that need to happen, so I'll just leave this at the statement of the main issues I have.
My concern about this is that it has been previously mentioned numerous times that staging has transactions and database additions in it that production doesn't have because we apply to staging and then move to production. Refreshes of staging from production on a regular schedule would bring an inconsistent state to staging as it would have newer code with an older database.
The only real time when we are certain that a refresh of staging can be automatic is at 0:00 on the 23rd of the month.
@northrup isn't that just the case between a deploy to production and a deploy to staging? I don't think staging is ever ahead of production immediately following a production deploy.
My point being though that deploys are random and sporadic, and the only
sure time I know to automate was the time given. I'm also going to posit
that a monthly refresh of data to staging is sufficient to get a fell for
transactional speed as well as data freshness. Any objections to refreshing
once a month?
What we need to do is: when we test restores (that we do once a month) we do it in the staging database. For which we just need to provision a new database server with terraform and execute the restore (manually).
This will kill 2 birds with one stone:
We will be testing the restores as we are, but we will also make those restores productive.
We will start getting closer to automating the restore because we will have a chef role/recipe that will be applied by terraform in the initial bootstrap, which can easily turn into restoring automagically.
cc/ @ahanselka@ilyaf could you please review what we need to do to replace the current staging with a new one that has a fresh backup restored using Wal-E ?
@omame I'd like to use this opportunity as a crash course in terraform-foo. I know the "restore" side of things from May B.A.D., and I'd like to automate that.
Woo-hoo, I won again! So much winning! /ducks and hides in the bushes
but seriously, yea, the result from last B.A.D. was just an update in documentation, but its still toil, and next person who performs this will still spend 4-6 hours. I want to leave it in much better state than that, while I still remember how I did it.
Plus, I need to restore my azure access back, got blocked or something :)
@omame I'd like to use this opportunity as a crash course in terraform-foo. I know the "restore" side of things from May B.A.D., and I'd like to automate that.
I'd love to! It's also a good opportunity for me to learn how to restore the db.
I need some automatic criteria on what to consider "restored" database. For now I just make sure postgres starts after wal-e restore, and (provided I usually restore to 6:00:00 time) I just run few selects like select last_edited_at from issues order by last_edited_at desc limit 1 and make sure that time is something like 5:59:59.123. Similar checks for secondary databases.
But, point is -- except from copying secrets and fiddling with GPG, everything is already in the form of crude shell scripts, meaning we can have a fully automated check right after Vault. I can automatically restore database to desired time, and then we have to figure out how to actually verify it.
@ernstvn I think its fine, its just blocked. We do need Vault (https://gitlab.com/gitlab-com/infrastructure/issues/1864) and after that we will need some automated criteria to check, otherwise the restoration will need human to poke around in db. But if i'm missing some context here and we need only restore db to staging w/o checking, then i'd say we only need vault.
I'll mark it as blocked now, tom make it explicit.
@djensen asked me about refreshing staging today since it's lagging GitLab.com by 2 years now or getting access to run analytics on a current database (i.e. what the data team has). @dawsmith Do you know which team might be owning refreshing staging?
@dawsmith yep, seems like #6572 (closed) is the long-term solution. But there hasn't been much activity on it, so GitLab is still performing QA in an environment that is insufficient in many ways. For example, I'm with ~"group::analytics". We are consistently unable to QA things properly in staging, and we're definitely introducing more bugs than we should
As a more immediate improvement, what about automating the refresh of only gitlab-org to staging? That wouldn't require the same scale of sanitization or processing as what #6572 (closed) proposes. And it would make the staging environment much more powerful for QA purposes. Would that be viable?
We have been talking about this on and off for quite a while already, in Datastores. Main reason is the work to achieve this "Automate refresh of database to staging" seems pretty significant, and the reliability teams are overloaded with other priorities lately.
To me, it's unlikely that this will be prioritized over many other things that are waiting in our queue. However I'll try to get more Infrastructure opinion on this and update this issue.