The Namespaces::InProductMarketingEmailsWorker is consistently crashing
**Update** Issue closed in favour of the epic: https://gitlab.com/groups/gitlab-org/-/epics/5594 From analysis in Sentry it appears the `Namespaces::InProductMarketingEmailsWorker` has been crashing almost every day since February 9th. The worker has been enabled since February 2nd. The only days the worker has not emitted an error is from 2-8 February, February 13th and March 3rd. A crashing worker doesn't necessarily mean no emails have been sent that day. It seems the worker is crashing on 4 different queries (see also the [Kibana logs](https://log.gprd.gitlab.net/app/discover#/?_g=h@bb5fdf8&_a=h@41f1a43) for the last 7 days here): https://sentry.gitlab.net/gitlab/gitlabcom/issues/2482770/events/?query=is%3Aunresolved%20InProductMarketingEmailsWorker (13 events) <details> <summary>SQL</summary> ```sql WITH RECURSIVE "base_and_ancestors" AS ( ( SELECT "namespaces".* FROM "namespaces" INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id" WHERE "namespaces"."type" = $ 1 AND "onboarding_progresses"."git_write_at" BETWEEN $ 2 AND $ 3 AND "onboarding_progresses"."pipeline_created_at" IS NULL AND "namespaces"."id" = $ 4 ) UNION ( SELECT "namespaces".* FROM "namespaces", "base_and_ancestors" INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id" WHERE "namespaces"."type" = $ 5 AND "onboarding_progresses"."git_write_at" BETWEEN $ 6 AND $ 7 AND "onboarding_progresses"."pipeline_created_at" IS NULL AND "namespaces"."id" = "base_and_ancestors"."parent_id" ) ) SELECT "namespaces".* FROM "base_and_ancestors" AS "namespaces" INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id" WHERE "namespaces"."parent_id" IS NULL LIMIT $ 8 ``` </details> https://sentry.gitlab.net/gitlab/gitlabcom/issues/2488025/events/?query=is%3Aunresolved%20InProductMarketingEmailsWorker (12 events) <details> <summary>SQL</summary> ```sql WITH RECURSIVE "base_and_ancestors" AS ( ( SELECT "namespaces".* FROM "namespaces" INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id" WHERE "namespaces"."type" = $ 1 AND "onboarding_progresses"."created_at" BETWEEN $ 2 AND $ 3 AND "onboarding_progresses"."git_write_at" IS NULL AND "namespaces"."id" = $ 4 ) UNION ( SELECT "namespaces".* FROM "namespaces", "base_and_ancestors" INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id" WHERE "namespaces"."type" = $ 5 AND "onboarding_progresses"."created_at" BETWEEN $ 6 AND $ 7 AND "onboarding_progresses"."git_write_at" IS NULL AND "namespaces"."id" = "base_and_ancestors"."parent_id" ) ) SELECT "namespaces".* FROM "base_and_ancestors" AS "namespaces" INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id" WHERE "namespaces"."parent_id" IS NULL LIMIT $ 8 ``` </details> https://sentry.gitlab.net/gitlab/gitlabcom/issues/2489971/events/?query=is%3Aunresolved%20InProductMarketingEmailsWorker (1 event) <details> <summary>SQL</summary> ```sql WITH RECURSIVE "base_and_ancestors" AS ( ( SELECT "namespaces".* FROM "namespaces" INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id" WHERE "namespaces"."type" = $ 1 AND "onboarding_progresses"."git_write_at" IS NOT NULL AND "onboarding_progresses"."pipeline_created_at" IS NOT NULL AND GREATEST( "onboarding_progresses"."git_write_at", "onboarding_progresses"."pipeline_created_at" ) BETWEEN $ 2 AND $ 3 AND "onboarding_progresses"."trial_started_at" IS NULL AND "namespaces"."id" = $ 4 ) UNION ( SELECT "namespaces".* FROM "namespaces", "base_and_ancestors" INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id" WHERE "namespaces"."type" = $ 5 AND "onboarding_progresses"."git_write_at" IS NOT NULL AND "onboarding_progresses"."pipeline_created_at" IS NOT NULL AND GREATEST( "onboarding_progresses"."git_write_at", "onboarding_progresses"."pipeline_created_at" ) BETWEEN $ 6 AND $ 7 AND "onboarding_progresses"."trial_started_at" IS NULL AND "namespaces"."id" = "base_and_ancestors"."parent_id" ) ) SELECT "namespaces".* FROM "base_and_ancestors" AS "namespaces" INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id" WHERE "namespaces"."parent_id" IS NULL LIMIT $ 8 ``` </details> https://sentry.gitlab.net/gitlab/gitlabcom/issues/2497046/events/?query=is%3Aunresolved%20InProductMarketingEmailsWorker (1 event) <details> <summary>SQL</summary> ```sql INSERT INTO "experiment_subjects" ( "experiment_id", "group_id", "created_at", "updated_at" ) VALUES ($ 1, $ 2, $ 3, $ 4) RETURNING "id" ``` </details> The last event could have been a race condition where a namespace was deleted before it could be recorded in `experiment_subjects`. We need to guard against such a failure. Currently it crashes the whole worker. In the other events, somehow an incorrect query is created. I could not find out how this is possible, since when running the worker locally, those queries are not executed. It seems somewhere the `root_namespace` is being queried in the `can_perform_action?` method before any of the groups have been materialized (so the ActiveRecord relation is expanded to include root_namespaces?) Besides fixing these wrong queries, in order to better guard against these kind of failures, it might be necessary to fire a separate worker for each namespace, so when a failure occurs in one namespace, at least emails will still be send for other namespaces. the only problem is currently we keep an in memory list of user_ids for which we have sent an email already, so as not to send the same user 2 emails on the same day. We need to find another solution for that problem.
issue