Incubation:APM System Design
GitLab have acquired opstrace - https://venturebeat.com/2021/12/14/gitlab-acquires-open-source-observability-distribution-opstrace/.
The APM SEG has been folded into the opstrace project here - https://gitlab.com/gitlab-org/opstrace. We are continuing to build on ClickHouse for traces and logs.
High level design for the Monitor APM product (see https://about.gitlab.com/handbook/engineering/incubation/monitor-apm/ for more details).
The general goal here is to support a single agent solution (DataDog initially) so users can observe infrastructure and applications in a broad range of environments (not just k8s), providing visualization and at some point alerting/SLO management.
This is a SaaS first solution. We'll be leveraging cloud services where appropriate to ship faster. We'll need to reconsider some of these components if the solution needs to be distributed in Omnibus.
Initial proposed architecture:
graph TB
dd-agent[DataDog agents] -- o11y data & auth --> gateway[Gateway Service]
other-agent[Other compatible agents] -- o11y data --> gateway
apiclients[API clients] --> gateway
users[Visualization Users] --> grafana[Grafana]
subgraph Google Cloud
gateway -- o11y data --> pubsub((Google PubSub))
pubsub --> ingest[Ingestion Workers]
pubsub --> enrich[Enrich/<br>Validate/<br>Limit] --> pubsub
ingest --> clickhouse[("ClickHouse Cluster<br>*may need Zookeeper*")]
gateway -- Token auth --> gitlabcom((GitLab.com))
grafana -- queries --> query[Query Service]
grafana -- OAuth --> gitlabcom
gitlabcom -- embedded views --> grafana
grafana -- user/group settings --> postgres[(PostgreSQL)]
query --> clickhouse
end
Notes:
- Services I create will be in Golang. I have some Ruby experience but have done no rails whatsoever. While I would be more than happy to learn, I don't think it would be as productive. The size of messages (DataDog supports 3.2MB per message) and frequency may lend themselves to Golang in the long run anyway.
- DataDog is being evaluated - #3 (closed)
- ClickHouse is being evaluated as a store for all observability data - #4 (closed)
- Google PubSub has been suggested as a solution for handling ingest/enrichment - https://gitlab.com/gitlab-org/gitlab/-/issues/338454#note_652755663
- We may want to tie data to a specific project/environment and validate this through the API.
- We may support basic PromQL (building up as we go) to allow querying from Grafana and long term ease of migration for dogfooding.
- We shouldn't allow Grafana to talk to ClickHouse directly - users shouldn't see the database implementation directly. Also because security.
- I'm not thinking too much about alerts etc. at this point.
Assumptions:
- Apart from PubSub we'll target Kubernetes for all other components.
- DataDog agent uses API key auth, we can use GitLab access tokens for auth and some data enrichment.
- Grafana can be provisioned for users/projects/environments somehow.
- Grafana GitLab OAuth will be suitable to allow embedding in GitLab.com.
Questions:
- Should this be hosted in the same environment as GitLab.com?
- Perhaps a specific node pool?
- Given the service boundaries it could be in a smaller isolated k8s instance.
- Should source code go into GitLab.com or be separate for the time being (a la Gitaly, Runner etc.)?
- I would lean towards separate to speed up development, but how should we conduct code review?
- Do we need to implement rate limiting from the start (DataDog specific rate limits)?
- We likely need new subdomains, for the APM gateway and user facing Grafana. Is this acceptable?
Any feedback on this design, any thoughts, questions, suggestions welcome!