Improve visibility into EP tooling

Context

A lot of EP toolchains are embedded in the context of CI jobs and schedules. Some prominent examples are: triage, review app deployment, QA build triggers. Our main channel for feedback have been mostly CI job success, Slack notifications, dashboards.

Problem

How do we know what has gone wrong if one of the toolchains does not work as expected?

Currently we have the following to alert us (feel free to edit this issue if I missed anything out):

Review app deployment:
- GCP dashboard which gives us low level infrastructure and resource statistics.
- Periscope dashboard which is very high level and dependent on many factors: GitLab data models, analytics database sync.
Triage ops:
- Slack notifications on based on CI pipeline status
QA builds:
- Slack notifications on based on CI pipeline status

A common pattern I find myself into is:

Encounter an unexpected result (e.g slack notification on a failed job)
Investigate the job page and logs
Identify what went wrong through the logs

At this point, very often it's hit or miss, depending on how much information is available in the job logs.

I wonder if we are missing a layer of information that could help us better debug whenever any of these toolchains fails. What can we do to improve the toolchains in this aspect?

I'm curious how everyone else in the team feel about this and where it falls in relative priority. /cc @gl-quality/eng-prod

I'm also aware this may be kind of an epic and we would need to identify smaller steps to iterate. I'm interested to know what kind of an end goal would be desired and also viable.

Edited Jan 20, 2020 by Albert Salim