Refresh staging DB regularly

We haven't refreshed staging in a really long time. This is making it increasingly hard to test query plans due to discrepancies in the data distribution on staging. For instance there are no CI artifacts newer than months so any plans involving non-expired artifacts find 0 records and run quickly. This may have been a contributing factor in failing to detect a bad SQL which caused a recent outage.

"If something hurts do it more often" -- let's plan to refresh staging once around kickoff each month. That way any proposed changes can be tested on a database that has an accurate db schema and recent data from production.

We could any of:

  • Create a new instance each month, do a full databases restore on it from S3/GCS and then activate that as the new staging db and throw away the old one.
  • Shut down the existing staging postgres instance, delete the database and do a restore, then bring it back up
  • Create an lvm snapshot of a clean restore before promoting the database and when doing a refresh throw away the snapshot, apply a month's worth of xlog from S3, then create a new snapshot and promote it.

The former is the most work but I think it's the best option. It would entirely replace the Restore Appreciation Day work and allow us to avoid doing that in addition to this. And it has the advantage that we can do this in parallel with using the old staging and have a rapid switchover. We could even run multiple staging environments without much additional effort -- possibly even one for each developer.

The main reason it's the most work is because it would require some Production team participation to create the new machine properly and switch over the role of the machine when it's ready.

Sanitizing the data in staging will be an issue. The existing staging has minimal santization -- I think just web-hooks and user emails are removed so we don't accidentally email users from staging. I hope we can proceed with improving the current refresh situation without letting this sanitization issue block us but we should plan down the road on how to sanitize this data better so we can allow developers more access to staging without giving access to private data.

Assignee Loading
Time tracking Loading