Disable Integrated Error Tracking by Default
# Problem [Integrated Error Tracking](https://docs.gitlab.com/ee/operations/error_tracking.html#integrated-error-tracking) was introduced to GitLab in %"14.4" as a lightweight alternative to the Sentry backend. There are already 24+ million entries in the `error_tracking_error_events` as of a month ago for SaaS. As we've seen from our own Sentry instance, it's very easy to generate lots of events in a short time, especially if something is wrong. Right now if you exclude CI-related tables, the [metrics show](https://thanos-query.ops.gitlab.net/graph?g0.expr=sum%20by%20(relname)%20(rate(pg_stat_user_tables_n_tup_ins%7Benv%3D%22gprd%22%2C%20relname%3D%22error_tracking_error_events%22%7D%5B1m%5D))&g0.tab=0&g0.stacked=0&g0.range_input=1w&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D&g1.expr=topk(20%2C%20sum%20by%20(relname)%20(increase(pg_stat_user_tables_n_tup_ins%7Benv%3D%22gprd%22%2C%20relname%20!\~%22ci_.\*%22%7D%5B24h%5D)))&g1.tab=1&g1.stacked=0&g1.range_input=1d&g1.max_source_resolution=0s&g1.deduplicate=1&g1.partial_response=0&g1.store_matches=%5B%5D) that `error_tracking_error_events` holds the top 16 of 20 number of inserts over a 24-hour period. Our integrated error tracking solution is currently using the Postgres database. This has already caused some incidents, [example](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6295), and puts GitLab.com at risk. # Proposal `Add a feature flag for integrated error tracking until error tracking is prioritized by product and is production ready.` **Proposal [Approved](https://gitlab.com/groups/gitlab-org/-/epics/7580#note_853107973) on 2/23** # Why isn't Error Tracking a Product Priority? An Opportunity Canvas was completed in late December 2021 for [Error Tracking](https://gitlab.com/gitlab-org/gitlab/-/issues/345058). It was determined that we should revisit Error Tracking in mid FY23. Given the timeline of the integration milestones for OpsTrace and GitLab and one of the Top 11 projects for FY23 is Reliability, it is recommended that the Monitor:Respond engineering team continues to work on Incident Management at a minimum through Q2FY23. The strategy for Incident Management is to mature the related categories and enable dogfooding by our SRE team. GitLab Incident Management will allow our internal teams to consolidate tools, move away from homegrown tools, and better scale their incident management process. # Alternative Options We've Considered ### Alt Option 1 `Implement safety guards around Integrated Error Tracking.` This would minimize the risk of large amounts of data coming in, this could include the following issues: - https://gitlab.com/gitlab-org/gitlab/-/issues/352616+ - https://gitlab.com/gitlab-org/gitlab/-/issues/345255+ - https://gitlab.com/gitlab-org/gitlab/-/issues/350399+ **Why we aren't choosing this option:** Rate limiting or minimizing payload size wouldn't give users the functionality they need for error tracking. To identify and remediate errors users need a complete picture (all data) to understand what is going on. ### Alt Option 2 `Migrate Integrated Error Tracking from the postgres database to a Clickhouse database.` **Why we aren't choosing this option:** This can be a natural follow-up after we disable the integrated error tracking. We need to prioritize the removal due to the risks Integrated Error Tracking currently has on GitLab.com. Running a Clickhouse instance in production will require a considerable investment in learning and scaling. We are also already working to implement [Clickhouse with Opstrace](https://gitlab.com/gitlab-org/gitlab/-/issues/350399#note_840490840), we don't want to start another similar stream of work in parallel. ### Alt Option 3 `Remove Integrated Error Tracking code` **Why we aren't choosing this option:** Once we have a better grasp on how to best implement an alternative database, Integrated Error Tracking becomes feasible again. This is something that is [actively being worked on](https://gitlab.com/groups/gitlab-org/opstrace/-/epics/10) by the ~"group::observability" and we don't want to duplicate efforts/ # Internal Communication Discuss and receive approval from: * [ ] CEO @sytses * [x] VP of Product @david * [x] The Product Director relevant to the affected Section(s) @kencjohnston * [x] The Engineering Director relevant to the affected Section(s) @sgoldstein * [x] Director of Product Design @vkarnes * [x] Group Product Manger @kbychu * [x] Engineering Manager @crystalpoole * [x] Product Manager @abellucci The following people need to be informed: * [x] Vice President of Development @clefelhocz1 * [x] Vice President of Infrastructure @sloyd * [x] Vice President of Quality @meks * [x] Vice President of User Experience @clenneville * [x] The Product Marketing Manager relevant to the stage group(s) @supadhyaya * [x] Senior Manager, Technical Writing @susantacker ### After Approvals * [ ] Mention the [product group Technical Writer](https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers) to update the [documentation metadata](https://docs.gitlab.com/ee/development/documentation/#stage-and-group-metadata) * [ ] Share MR in #product, #development, and relevant #s_, #g_, and #f_ slack channels # External Communication **Considerations:** As of mid-February 2022, there are [~313 projects](https://gitlab.slack.com/archives/C0341JVB0H0/p1645479871774859) that have submitted errors to our integrated tracking backend. While this isn't huge, it isn't insignificant. We'll need to let users know that we are turning this feature off. It's important to be transparent, but we shouldn't announce the vulnerabilities this feature currently has to GitLab. ## Customer Lists In order to appropriately communicate directly to impacted customers we will need: - Namespaces who have submitted Errors to integrated error tracking backend on GitLab.com (by Tier) - Including Namespace, Project, Project Owner Email, Namespace Owner Email, Namespace Tier - [By Tier List](https://docs.google.com/spreadsheets/d/1KX_mue1-FlDnOmFsw-bLxLswo8WJqsoVtnMSr_8NLao/edit#gid=2134253232) - Instances that have projects submitting integrated error tracking backend (by Tier) - Including Group, Project, Project Owner Email, Instance Admin Email, Instance Tier - [Current customer list](https://docs.google.com/spreadsheets/d/13SOMekG63sIR3Vgn9tmv9DREAL5rsSwt2oNm8hjQPZo/edit#gid=1280016283) ## Communication To Rate-Limit Impacted Users Proposed communication for users impacted by the recent [change to severely rate-limit](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6399) the feature. > From: `<incident-response@gitlab.com>` > Subject: Important Update: Information for customers using Integrated Error Tracking Feature on GitLab.com > > Hello GitLab User, > We've identified your project(s) as utilizing GitLab's recently released integrated error tracking backend that replaces Sentry. This feature is causing database performance issues. To address these issues, we significantly rate-limited this feature to 1 request per hour per IP. With that rate limit in place we recommend against using it. > > While we explore future development of this feature, you can consider switching to the sentry backend by changing your error tracking to Sentry in your project settings. > > We're working to reduce the conditions that can cause these issues and will provide updates in this epic. If you are using integrated Error Tracking and require additional guidance, please ask questions in the above epic or reach out to support if you have a paid subscription. > > Affected Projects: <project Name> > > Kind regards, > {Sender} ## Additional Comms as we Disable - [x] Create an `extra` for the Release Post (extras appear beneath the list of Deprecations and Removals to over-communicate on the change), https://gitlab.com/gitlab-com/www-gitlab-com/-/merge_requests/100258+ - [x] A `removal` entry for 14.9, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/82621+ - [x] Create an issue for customers to provide feedback that links to the `removal` entry and the `extra, https://gitlab.com/gitlab-org/gitlab/-/issues/355493+ # Next Steps 1. Alignment with internal teams and leadership 1. Create a technical breakdown issue of what's involved to remove this issue
epic