Implement Audit Mechanism and Distributed Tracing for Message Flow in Package Metadata DM
The recent alert during the npm testing with deps.dev revealed the need for better traceability and monitoring of our message flow across services. There was a significant drop in the number of Interfacer containers and queue messages between 6:25 PM and 6:30 PM UTC+2, with logs indicating that requests were aborted due to unavailable instances. This incident highlighted potential processing issues and gaps in our current monitoring setup.
Problem
- High volume of messages generated during testing led to processing issues.
- Lack of clear traceability and metrics to understand the flow and processing status of messages.
- Current monitoring does not provide sufficient insights into whether messages are being dropped or not.
Proposed Solution
-
Audit Mechanism: Implement an audit mechanism where each run of the feeder creates a UUID and includes it in each message. This allows tracking of messages across services.
- Use Prometheus counters to collect metrics from the feeder, interfacer, and processor services.
- Associate these counters with the UUID to compare and verify message counts.
-
Distributed Tracing: Integrate a distributed tracing tool like Jaeger to trace the flow of messages through the microservices.
- Instrument services with Jaeger client libraries to capture and report traces.
- Ensure tracing context is propagated across all services.
- Use Jaeger’s UI for visualization and analysis of traces to identify bottlenecks and failures.
Implementation
TBC
Edited by Philip Cunningham