Proposal: Decouple Provisioner Architecture to Increase Extensibility
Root Cause for Change
The nightly High Availability provisioner pipeline started failing when the Ansible Community deprecated gce.py
as part of the Ansible v2.8 upgrade. The resulting work to replace it turned into a larger engineering question that spawned this proposal.
Current State
Current Architecture
- terraform: node provisioning tool
- Ansible: service deployment and orchestration too
- gce.py: inventory module
graph LR;
tf -->|provisioning request|cloudapi
inv -->|request host data|cloudapi
subgraph "Orchestration"
orch[ansible] -->|query hosts|inv[gce.py]
end
subgraph "Cloud Provider"
cloudapi[google cloud] -->|request creation|cluster[GCE machines]
end
subgraph "Provisioner"
tf[terraform] -->|transmit state data|tfstate[terraform state data store];
end
Overview of Concerns
Stability
- ansible inventory modules are typically considered best-effort support.
-
gce.py
togoogle_compute
transition demonstrates the danger of best effort in what amounts to production - terraform formally supports its provider modules
Extensibility
- terraform provides support for more cloud providers than Ansible does supported inventory modules
- each additional provider is double the support
Maintainability
- supporting the tool has elastic support issues; if a provider changes then we must evaluate both terraform and the inventory providers to regain service
Modularity
- tight coupling between Ansible, terraform via
jq
reading the terraform state directly
Proposal
Top of Mind During Design
- designing for the ability to provision HA and Geo
- supports the Hybrid Cloud
- enables customers not using terraform
- reduction of elastic maintenance (tight coupling between two rapidly changing components)
- reduction of time-to-test for new providers in GitLab pipelines
Architecture
- terraform: node provisioning tool
- Ansible: service deployment and orchestration too
-
Inventory Parser: reads JSON blob to create Ansible inventory
- reads a JSON whose structure GitLab dictates
graph LR;
tf -->|provisioning request|cloudapi
subgraph "Orchestration"
orch[ansible] -->|query hosts|tfinv[Inventory Parser]
end
subgraph "Cloud Provider"
cloudapi[cloud service provider] -->|request creation|cluster[provisioned nodes]
end
subgraph "Provisioned State"
tfinv[Inventory Parser] -->|read state|provstate
provstate[node state data]
end
subgraph "Provisioner"
tf[terraform] -->|transmit state data|tfstate[terraform state data store]
tf -->|write state|provstate
end
Comparison
Components without Change
-
ansible: will remain as the orchestration tool of choice
- currently the most popular
- core is compliant/compatible with secure deployment requirements
- lightweight and easy to automate/deploy including within GitLab pipelines
-
terraform:
- internal use is widespread
- solid tool with good support
- compatibility guarantees within major versions
- direct support for many cloud providers
New Components
-
Inventory Parser [optionally implemented as an Ansible inventory plugin]
- drawback: extra code for us to support, mitigated by the JSON FileMap
- benefit: eliminates dependency on unsupported ansible inventory modules
- benefit: if we are creating it, if we are certified then it is certified for use in secure installations
- benefit: immediately compatible with Hybrid Cloud
- benefit: decouples our playbooks from dependency on jq and terraform for some variable definitions enabling more of our customers
-
JSON FileMap of Provisioned Nodes
- drawback: we have to maintain compatibility
- benefit: customers who don't use terraform can output to this format from any provider including ones terraform doesn't support and use our playbooks
- benefit: updating our internal terraform configurations means the playbooks automatically support any new providers we bring online
Summary
The core argument for this proposal is thus:
- eliminating our dependency Ansible inventory also decouples GitLab from specific providers and our customers from specific provisioners
- elastic maintenance is equally expensive to maintaining an in house tool, especially when the in-house tool adds velocity to supporting more providers and hybrid configurations
- increasing our adoptability in secure deployment conditions or areas where alternative provisioning systems already exist adds value overall
Edited by Robert Marshall