The Namespaces::InProductMarketingEmailsWorker is consistently crashing
Update Issue closed in favour of the epic: &5594
From analysis in Sentry it appears the Namespaces::InProductMarketingEmailsWorker
has been crashing almost every day since February 9th. The worker has been enabled since February 2nd. The only days the worker has not emitted an error is from 2-8 February, February 13th and March 3rd. A crashing worker doesn't necessarily mean no emails have been sent that day.
It seems the worker is crashing on 4 different queries (see also the Kibana logs for the last 7 days here):
SQL
WITH RECURSIVE "base_and_ancestors" AS (
(
SELECT
"namespaces".*
FROM
"namespaces"
INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id"
WHERE
"namespaces"."type" = $ 1
AND "onboarding_progresses"."git_write_at" BETWEEN $ 2
AND $ 3
AND "onboarding_progresses"."pipeline_created_at" IS NULL
AND "namespaces"."id" = $ 4
)
UNION
(
SELECT
"namespaces".*
FROM
"namespaces",
"base_and_ancestors"
INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id"
WHERE
"namespaces"."type" = $ 5
AND "onboarding_progresses"."git_write_at" BETWEEN $ 6
AND $ 7
AND "onboarding_progresses"."pipeline_created_at" IS NULL
AND "namespaces"."id" = "base_and_ancestors"."parent_id"
)
)
SELECT
"namespaces".*
FROM
"base_and_ancestors" AS "namespaces"
INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id"
WHERE
"namespaces"."parent_id" IS NULL
LIMIT
$ 8
SQL
WITH RECURSIVE "base_and_ancestors" AS (
(
SELECT
"namespaces".*
FROM
"namespaces"
INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id"
WHERE
"namespaces"."type" = $ 1
AND "onboarding_progresses"."created_at" BETWEEN $ 2
AND $ 3
AND "onboarding_progresses"."git_write_at" IS NULL
AND "namespaces"."id" = $ 4
)
UNION
(
SELECT
"namespaces".*
FROM
"namespaces",
"base_and_ancestors"
INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id"
WHERE
"namespaces"."type" = $ 5
AND "onboarding_progresses"."created_at" BETWEEN $ 6
AND $ 7
AND "onboarding_progresses"."git_write_at" IS NULL
AND "namespaces"."id" = "base_and_ancestors"."parent_id"
)
)
SELECT
"namespaces".*
FROM
"base_and_ancestors" AS "namespaces"
INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id"
WHERE
"namespaces"."parent_id" IS NULL
LIMIT
$ 8
SQL
WITH RECURSIVE "base_and_ancestors" AS (
(
SELECT
"namespaces".*
FROM
"namespaces"
INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id"
WHERE
"namespaces"."type" = $ 1
AND "onboarding_progresses"."git_write_at" IS NOT NULL
AND "onboarding_progresses"."pipeline_created_at" IS NOT NULL
AND GREATEST(
"onboarding_progresses"."git_write_at",
"onboarding_progresses"."pipeline_created_at"
) BETWEEN $ 2
AND $ 3
AND "onboarding_progresses"."trial_started_at" IS NULL
AND "namespaces"."id" = $ 4
)
UNION
(
SELECT
"namespaces".*
FROM
"namespaces",
"base_and_ancestors"
INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id"
WHERE
"namespaces"."type" = $ 5
AND "onboarding_progresses"."git_write_at" IS NOT NULL
AND "onboarding_progresses"."pipeline_created_at" IS NOT NULL
AND GREATEST(
"onboarding_progresses"."git_write_at",
"onboarding_progresses"."pipeline_created_at"
) BETWEEN $ 6
AND $ 7
AND "onboarding_progresses"."trial_started_at" IS NULL
AND "namespaces"."id" = "base_and_ancestors"."parent_id"
)
)
SELECT
"namespaces".*
FROM
"base_and_ancestors" AS "namespaces"
INNER JOIN "onboarding_progresses" ON "onboarding_progresses"."namespace_id" = "namespaces"."id"
WHERE
"namespaces"."parent_id" IS NULL
LIMIT
$ 8
SQL
INSERT INTO
"experiment_subjects" (
"experiment_id",
"group_id",
"created_at",
"updated_at"
)
VALUES
($ 1, $ 2, $ 3, $ 4) RETURNING "id"
The last event could have been a race condition where a namespace was deleted before it could be recorded in experiment_subjects
. We need to guard against such a failure. Currently it crashes the whole worker.
In the other events, somehow an incorrect query is created. I could not find out how this is possible, since when running the worker locally, those queries are not executed. It seems somewhere the root_namespace
is being queried in the can_perform_action?
method before any of the groups have been materialized (so the ActiveRecord relation is expanded to include root_namespaces?)
Besides fixing these wrong queries, in order to better guard against these kind of failures, it might be necessary to fire a separate worker for each namespace, so when a failure occurs in one namespace, at least emails will still be send for other namespaces. the only problem is currently we keep an in memory list of user_ids for which we have sent an email already, so as not to send the same user 2 emails on the same day. We need to find another solution for that problem.