Removing sensitive information from ElasticSearch
Sensitive information sometimes accidentally pushed to Git repositories. Although it is possible to safeguard against this in various ways it isn't possible to succeed every time. Sensitive information is a broad definition that could include trade secrets in the form of lab results from testing an experimentally drug, or personal information of a real person that was used to replicate and fix a bug. Unlike passwords, this information can't be rotated and needs to be permanently removed from the repository.
Sensitive information can be removed from the Git repository using https://gitlab.com/gitlab-org/gitlab-ce/issues/19376
Further details
In addition to the repository, there are other places where this data may also reside:
- cached diffs (https://gitlab.com/gitlab-org/gitlab-ce/issues/30093)
- elastic search (this issue)
- CI pipelines (supported via API since GitLab 11.6, docs)
Proposal
Extending the approach taken in https://gitlab.com/gitlab-org/gitlab-ce/issues/19376 we need to make sure that other data derived from contaminated commits is removed from ElasticSearch.
When an instance administrator who uploads a list of bad SHAs to be removed, sensitive data will also be removed from ElasticSearch.