Documentation: Delay in SnowPlow Enrichers

Summary

This is a documentation issue to add FAQ section or page to PI handbook and answer to What we should do in case of a Delay in SnowPlow Enrichers alert. Please see this thread for more context.

Historical Context

The below information is carried over for reference purposes

We got multiple alerts for "SnowPlow Raw Good Stream Backing Up". This is because our Snowplow Enrichers didn't scale well enough for the amount of snowplow events. We didn't loose data, but if the delay passes >48h we would. To get more info what happened, please see:

Discussion

In this issue I'd like to discuss:

1. What we should do in case of an alert

We don't have a documentation in the case of an alert. My suggestion:

The alert should be coming in for at least X hours before we react. (Please suggest a sensible X)
The delay is longer than Y minutes (please suggest a sensible Y). Note: We should also describe that the delay is in ms, as 300000k ms is an irritating number at least to me.
If alert is coming in for X hours and delay is Y minutes, ping the SRE on-call with the following template:
```
Hey Team, .... 
```
Add described steps into our Snowplow Troubleshooting page that Alina recently created

Happy to see another suggestion or a proposal for the text we should send to the SREs.

2. Adjusting the alert

We got delays of ~5 minutes but don't know the root cause. The speculation from @cmcfarland is:

I'm pretty sure this is a problem where we don't have enough shards relative to collectors. I suspect that if we see a lot of activity from a certain IP range, it clobbers a shard in kinesis since I think we use source IP as a hash for which shards get used.

My question is: Is it worth for us to investigate if we get a 5min delay? I know, there shouldn't be a delay, but currently this is the reality and maybe we're okay with a e.g. 20minute (made up number) delay? Especially since our max delay before we lose data is 48hours.

Edited Mar 18, 2022 by Amanda Rueda