Impove operational aspects of triage-ops

We have been increasing usage of triage-serverless and some of them are having increased importance, for example Engineering Allocation labels and triggering new pipelines. The complexity of each processor is also increasing, having more external dependencies and potential points of failure.

There are a few areas we can improve on:

1. Performance measurements

There is currently no measurements on the performance of triage-serverless. We cannot identify the impact of adding each processor to the system. For example, how long would it take to complete a new processor, how long would it delay other processors in the chain.

2. Improving retry mechanism

Some real time processor may be more critical than others. Any event that failed to be processed for any reason, such as external network error, is dropped, so there is no guarantee that an event will be processed.

Edited Sep 06, 2022 by Lin Jen-Shin