We had a partial outage of Snowplow events collected during a four week period between Sep 14 - Oct 12 2020. During this time, we saw a 27% decrease in the number of events collected, resulting in a roughly 27% decrease across performance indicators which use Snowplow as a data source.
The root cause of this partial outage was infrastructure related where our Snowplow Pipeline's enricher nodes ran out of disk space and were not set to autoscale alongside of the Snowplow collector nodes. This has now been resolved by infrastructure by setting the enricher nodes to autoscale.
Did we change anything in our snowplow implementation ?? @mpeychet_ sees a steep decrease of pageviews starting last week of september
It defers initDefaultTrackers into deferredInitialisation which also includes trackPageView
exportfunctioninitDefaultTrackers(){if (!Tracking.enabled())return;window.snowplow('enableActivityTracking',30,30);window.snowplow('trackPageView');// must be after enableActivityTrackingif (window.snowplowOptions.formTracking)window.snowplow('enableFormTracking');if (window.snowplowOptions.linkClickTracking)window.snowplow('enableLinkClickTracking');Tracking.bindDocument();Tracking.trackLoadEvents();}
Page views are tracked using the trackPageView method. This is generally part of the first Snowplow tag to fire on a particular web page. As a result, the trackPageView method is usually deployed straight after the tag that also invokes the Snowplow JavaScript (sp.js) e.g.
I wonder if window.snowplow('trackPageView') is indeed deferrable generally or for particular cases.
Another thing to check if the deferred snowplow tracking is executed, and not skipped due to other errors in the JS chain or the order of calls relevant to Snowplow.
It looks like there was some sort of change in event counts with the processing pipeline in September.
@cmcfarland would you be able to chime in here on what we're exactly looking at? The Stream Records Age piece makes me think there's some sort of backlog?
We're actually coming close to maxing out our snowplow collectors each day. Might need to right-size this upper limit of 48 collectors.
Incoming good raw counts mean that the collectors are getting good packets and adding them to that queue. And it's in pace with the traffic growth.
Outgoing raw good is flat. That's bad. We might need to grow the enricher fleet to keep up with all this traffic. It might even need to become an auto-scaling group.
I think I should also just check on these nodes and make sure they are doing ok.
I'm curious if things are better now. I restarted all the enricher nodes. They were out of disk space for some reason. Our collectors get re-built all the time since they scale up and down to meet traffic demands. But the enrichers were just three nodes plugging away at the data in the queue.
I think we need to make an infra issue to make the enricher nodes auto-scale just like the collectors.
There is no real logging structure around the service. This could be fixed. But, we could also just do what we do with the collectors and add an auto-scaling policy to scale in and out enrichers. This keeps a healthy turn-over of nodes and also makes sure we're running the latest OS ami and patches.
I've applied the changes. At a minimum, this will help offset future issues with old nodes and nodes running out of space. I'll watch it the next few days/week.
The auto-scaling for enrichers is working. The enricher nodes have a very spiky network pattern, so they don't scale as smoothly as the collectors, but it is working and there is plenty of headroom for growth.