Snowplow endpoint certificate expired for a day
Problem
On 2023/07/04 at 00:00 UTC the certificate for http://snowplow.trx.gitlab.net/ became invalid, which lead to many clients no longer sending events to the collector and instead throwing an error.
Detection
The incident was discovered by @vitallium when looking at his devtools when working on the CustomerDot and an incident was opened at 20:14 UTC. The error was also starting to be logged in Sentry
Impact
Most Clients no longer sent any Snowplow Events during the time the Certificate was invalid. We are missing most events for the period from 2023/07/04 00:00 UTC to 2023/07/04 22:00 UTC. See Screenshot from Cloudwatch dasbhoard:
Checklist
-
Assigned severity tags based on this guidance -
Assigned to PM and EM of groupanalytics instrumentation -
Posted link to incident in g_analyze_analytics_instrumentation
and tagged both PM and EM of the group
<---- TO BE FILLED BY ASSIGNEE / RESOLUTION DRI---->
Summary
An expired ssl certificate led to Clients no longer sending events to Snowplow
Root Cause
Expired SSL certificate. Potentially expired since Snowplow was marked as dead service in https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/certificates#defunct-certificates-dead-hosts-no-longer-used-etc
Resolution
The SSL certificate was promptly renewed by @stejacks-gitlab and @nduff as part of the Incident handling in gitlab-com/gl-infra/production#15978 (closed) and Snowplow collection is back to normal.