Draft: AutoFlow: Challenges, opportunities, and trade-offs
Problems to solve
- GitLab's ability to provide our customers with native, self-service tools to automate more than pipelines is virtually non-existent.
- Automation is fourth on the list of highest-priority investment areas in 2024 according to our 2024 DevSecOps report, 27% of respondents indicating that increased automation would improve overall developer satisfaction.
- GitLab's ability to build an automation solution that encompasses the entire platform is significantly hampered by the lack of a centralized, platform-level solution for eventing and stateful workflow management.
- Our historical approach to eventing and automation has primarily been local optimization targeting a singular use case, which has led to "eventing fragmentation" and overlapping and, sometimes, competing solutions (#344136). GitLab serves diverse personas, from developers to non-developers who have no idea what a Merge Request is.
- All of the current workarounds require extensive knowledge of writing code and authoring pipelines/jobs, which is a non-starter for a large subset of our target audience.
- Current technical solutions such as pipelines/jobs are not scalable to solve the emerging needs of many product areas.
- We do not have the necessary infrastructure to support a durable, long-lived workflow engine that consumes events for simple and complex automation tasks.
Current workarounds
- Customers invest significant time and money hand-writing automation tools against our APIs.
- Customers leverage open-source solutions such as gitlab-traige, which is what we use internally for many automation use cases for issues, MRs, branches, and epics.
- Customers use a third-party solution such as Zapier or Unito.io
Downsides to these workaround:
- Webhooks are generally not resilient.
- Not all events are covered by existing endpoints.
- Expensive and time-consuming for customers.
- Cannot easily support complex, durable workflows (ex: update issue label when MR is merged).
- Third-party tools and products only cover a subset of our platform's feature set.
- These approaches are also generally a non-starter for our non-developer personas.
Requirements
- Centralized eventing system
- This does not necessarily mean introducing new tech or recreating the wheel. We can focus on closing the loop for https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/gitlab_events_platform/. TL;DR: We need to make a decision, align around a standard approach, and then implement any missing pieces.
- Durable, long-lived workflow engine that consumes events for simple and complex automation tasks
- We don't have anything suitable in our current infrastructure to support the use cases for this. CI/Pipelines is not designed for the more complex, longer-running workflows we hope to support (see #471880[AutoFlow_Poster.png]). Gitlab-triage is designed to work by triggering pipelines to run, polling API endpoints for changes, then applying changes based on the policies. This has the same downsides of not being "reactive" (based on an event system) and those noted as to why we can't use CI.
What capabilities should our (event-based) automation have? (These requirements come from the CD and Dev sections.):
- Integrate with all the GitLab data primitives like MRs, Environments, etc.
- Support every GitLab deployment type and market segment.
- It can be optimized to run small tasks like a single function instead of a series of steps. This is important from a UX and cost optimization perspective as well.
- Example: Run a task on every label change or every comment of every issue within an organization.
- Allow modeling complex workflows that can not be described in a DAG. Preferably, it allows visualizing the workflow with possible branching, merging, and decisions. We want to support Petri-nets, not DAGs [1].
- Example: Deploy to staging. Run some tests. If everything is green, do X. If there are warnings, do Y. If there are errors, do Z. Visualize in a single pipeline definition.
- Support idempotency at the core. This allows well-behaving retries and timeout management.
- Example: Network error during a deployment or rollout.
- Support for long-running processes that need to wait for external triggers for any length of time without wasting resources, and the ability to pick up the task and route correctly (in pre-programmed ways) if needed.
- Example: Database migrations and related post-processing
- Support sub-flows with well-defined inputs and outputs to be passed around.
- Example: Coordinate jobs across projects or even across GitLab instances.
[1]: While I’m (Viktor N.) by no means an expert in workflow modeling, here are a few quotes and resources to introduce you to Petri-nets:
- “As DAGs are directed, it is impossible to define bi-directional coupling schemes between software components. DAGs are acyclic, so it is not feasible to explicitly define loops (while . . . do). A DAG only describes the behavior, but not the state of the system. DAGs generally describe only the workflow, and not the dataflow.” (source)
- DAGs can be seen as a subset of what Petri nets can represent.
- I have some familiarity (past experience from career) implementing ERP systems. They use Petri-net capable process modeling.
- Wikipedia article on Petri-nets
Stages with the most need
- Plan
- Continuous Delivery
- Core Platform (personal productivity)
- ...
What is AutoFlow
See #471880[AutoFlow_Poster.png] for a summary of the goals, concepts, technologies, and use cases AutoFlow is intended to solve.
Architectural blueprint: https://gitlab.com/gitlab-org/gitlab/-/tree/master/doc/architecture/blueprints/autoflow
Roadblocks
-
Alignment (and implementation) on a standard eventing approach that will scale across the platform. Was there an outcome to https://handbook.gitlab.com/handbook/company/working-groups/event-stream/?
- Decision
- Documentation
- Drive awareness among product teams
- Reconcile overlapping solutions and have a concrete plan with a defined end date to consolidate them.
-
We have not aligned a solution for a scalable workflow engine.
- Decision
- Documentation
- Drive awareness among product teams
- Reconcile overlapping solutions and have a concrete plan with a defined end date to consolidate them.
-
If we do end up going with Temporal, it will require a significant investment from:
- Infrastructure to bundle Temporal with GitLab distributions by default. They do not have capacity until FY26.
- Resources/capacity to focus on the underlying BE logic for the workflow engine (~2-3 engineers)
- Resources/capacity to focus on integrating Temporal usage with GitLab's consumption pricing model (For SaaS) -- https://gitlab.com/gitlab-org/gitlab/-/issues/471718+
- Resources/capacity to build the core FE/UX/UI framework for AutoFlow, allowing any other product team to contribute quickly.