Postmortem of BigQuery data loss

This is a postmortem of the data loss we experienced on the 19th of August.

Cause of the issue: During an earlier refactoring, we overlooked a small adjustment of the config file, that changes where the bq_write_disposition should be defined. Due to this field not being read correctly, all the scheduled pipelines defaulted to WRITE_TRUNCATE resulting in replacing the existing tables with the new data instead of appending to them.

Resolution

After discovering the issue, we:

  • A coordination issue was opened
  • Restored the tables to an earlier stage using the bq cp command in the cloud shell and "time travelling"
  • The config files were updated with the correct write disposition
  • The other teams were informed about the data loss

Rough timeline of the events

  • 13/08: We merged the BQ write change
  • 19/08 10:17 CEST: A new tag was created that included the merged code
  • 19/08 14:03 CEST: The daily run pipelines start kicking off, but these all failed do to another change (not relevant in this issue)
  • 19/08 17:00 CEST: After fixing the issue above, new pipelines were started
  • 19/08 20:26 CEST: When verifying that the pipelines finished running after the fix, the tables were manually checked in BigQuery, that's when we first discovered the issue. One pipeline was cancelled in time before it could write to Bigquery.
  • 19/08 21:00 CEST: The BigQuery tables were restored to an earlier state, losing the evaluation data of the 19th.
  • 20/08 03:57 CEST: A code change to update the config files was merged.

@tle_gitlab @HongtaoYang @srayner @m_gill Please add your thoughts to the comments below as well as fix/update anything in the description you see fit.

Edited by Andras Herczeg