Snowplow endpoint certificate expired for a day

Problem

On 2023/07/04 at 00:00 UTC the certificate for http://snowplow.trx.gitlab.net/ became invalid, which lead to many clients no longer sending events to the collector and instead throwing an error.

Detection

The incident was discovered by @vitallium when looking at his devtools when working on the CustomerDot and an incident was opened at 20:14 UTC. The error was also starting to be logged in Sentry

Impact

Most Clients no longer sent any Snowplow Events during the time the Certificate was invalid. We are missing most events for the period from 2023/07/04 00:00 UTC to 2023/07/04 22:00 UTC. See Screenshot from Cloudwatch dasbhoard:

Screenshot_2023-07-05_at_09.29.32

Checklist

  • Assigned severity tags based on this guidance
  • Assigned to PM and EM of groupanalytics instrumentation
  • Posted link to incident in g_analyze_analytics_instrumentation and tagged both PM and EM of the group

<---- TO BE FILLED BY ASSIGNEE / RESOLUTION DRI---->

Summary

An expired ssl certificate led to Clients no longer sending events to Snowplow

Root Cause

Expired SSL certificate. Potentially expired since Snowplow was marked as dead service in https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/certificates#defunct-certificates-dead-hosts-no-longer-used-etc

Resolution

The SSL certificate was promptly renewed by @stejacks-gitlab and @nduff as part of the Incident handling in gitlab-com/gl-infra/production#15978 (closed) and Snowplow collection is back to normal.

Edited by Tanuja Jayarama Raju