Definition of a role limited to performing the lifecycle of workload clusters

What does this MR do and why?

The current way to deploy and update workload clusters relies on running the apply-workload-cluster.sh script, while using the kubeconfig of the management-cluster that grants cluster-admin privileges.

This feature makes it possible to use apply-workload-cluster.sh with a kubeconfig retrieved from Rancher by a user belonging to the Keycloak workload-clusters-managers group.

This MR proposes a way to grant the minimum privileges required to run the apply-workload-cluster.sh, in a way that is revocable: only the users belonging to a defined user group will have these privileges. So the permission to create/modify workload clusters can be granted to someone when an operation is necessary, and removed when the operation is over. It is an improvement compared to the current situation because the management cluster kubeconfig which can't be revoked; if someone gains access to it, this person can be considered as cluster-admin forever. However, the role defined in this MR still grants a very high level of privileges, and someone with these permissions can still do pretty much everything on the management and workload clusters (though a bit less easily as with the management cluster kubeconfig). So this evolution must not be considered as a bullet-proof protection, but rather as a mitigation by bringing the possibility of limiting high privileges to certain people for a limited duration. An improvement will come in the future with the gitops workflow, which will remove the need for a human to directly interact with the management cluster, and which will provide ways of controlling the cluster values to avoid unwanted modifications on the management and workload clusters.

The MR defines a role granting only the permissions necessary to deploy and remove workload clusters:

  • ClusterRole allowing to create the namespace of a workload cluster (first thing done when applying the environment's kustomization), to read nodes (used by apply-workload-cluster.sh), and to read CRDs (used by sylvactl)
  • A Role in the sylva-system namespace allowing to read the configmaps and kustomizations in the sylva-system namespace, used by apply-workload-cluster.sh
  • These two roles are assigned to the members of the workload-clusters-managers group in keycloak
  • A cluster role allowing to perform the required operations in the namespaces of the workload clusters, i.e. the top-level resources that must be created and deleted
  • A Kyverno ClusterPolicy which creates a RoleBinding to the above cluster role in namespaces created for workload clusters and identified by the presence of the sylva-project.org/workload-cluster label
  • several adaptations to apply-workload-cluster.sh to allow running the command with the kubeconfig of a user belonging to the workload-clusters-managers group in keycloak.
  • One of these adaptation is to create the namespace first and the rest of the other cluster resources next, in order to leave a time to create the rolebinding in the cluster's namespace.
  • The work had to be separated in two units, to create the RoleTemplate first, and the rest of the resources next. The kustomization dry-run fails if we do everything at one because the ClusterRoleTemplateBinding doesn't find it and it fails. For a reason, the creation of the ClusterRoleTemplateBinding is always tried before the creation of the RoleTemplate. This is not a problem with the classical Roles and RoleBindings.

It complements !4856 (closed) which deals with the permissions that the helm operator uses when reconciliating the sylva-units helmrelease, and sylva-projects/sylva-elements/sylva-units-operator!337 (merged) on the permissions granted to the sylva-units-operator.

Closes #1803

Test coverage

Manual testing:

  • creation of a user in keycloak and assign him to the workload-clusters-managers group
  • this user logs in Rancher and retrieves a kubeconfig for the "local" cluster.
  • set KUBECONFIG with this kubeconfig from Rancher, and run apply-workload-cluster.sh with it. It has also been done from a laptop, out of a bootstrap VM, with no management-cluster-kubeconfig around. The workload cluster is created, until the "All done!" message.
  • Remove the workload cluster by deleting the sylva-units helm release, and by deleting the namespace when it is over.

CI configuration

Below you can choose test deployment variants to run in this MR's CI.

Click to open to CI configuration

Legend:

Icon Meaning Available values
☁️ Infra Provider capd, capo, capm3
🚀 Bootstrap Provider kubeadm (alias kadm), rke2, okd, ck8s
🐧 Node OS ubuntu, suse, na, leapmicro
🛠️ Deployment Options light-deploy, dev-sources, ha, misc, maxsurge-0, logging, no-logging
🎬 Pipeline Scenarios Available scenario list and description
  • 🎬 preview ☁️ capd 🚀 kadm 🐧 ubuntu

  • 🎬 preview ☁️ capo 🚀 rke2 🐧 suse

  • 🎬 preview ☁️ capm3 🚀 rke2 🐧 ubuntu

  • ☁️ capd 🚀 kadm 🛠️ light-deploy 🐧 ubuntu

  • ☁️ capd 🚀 rke2 🛠️ light-deploy 🐧 suse

  • ☁️ capo 🚀 rke2 🐧 suse

  • ☁️ capo 🚀 rke2 🐧 leapmicro

  • ☁️ capo 🚀 kadm 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🎬 rolling-update 🛠️ ha 🐧 ubuntu

  • ☁️ capo 🚀 kadm 🎬 wkld-k8s-upgrade 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🎬 rolling-update-no-wkld 🛠️ ha 🐧 suse

  • ☁️ capo 🚀 rke2 🎬 sylva-upgrade-from-1.4.x 🛠️ ha 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🎬 sylva-upgrade-from-1.4.x 🛠️ ha,misc 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🛠️ ha,misc 🐧 ubuntu

  • ☁️ capm3 🚀 rke2 🐧 suse

  • ☁️ capm3 🚀 kadm 🐧 ubuntu

  • ☁️ capm3 🚀 ck8s 🐧 ubuntu

  • ☁️ capm3 🚀 kadm 🎬 rolling-update-no-wkld 🛠️ ha,misc 🐧 ubuntu

  • ☁️ capm3 🚀 rke2 🎬 wkld-k8s-upgrade 🛠️ ha 🐧 suse

  • ☁️ capm3 🚀 kadm 🎬 rolling-update 🛠️ ha 🐧 ubuntu

  • ☁️ capm3 🚀 rke2 🎬 sylva-upgrade-from-1.4.x 🛠️ ha 🐧 suse

  • ☁️ capm3 🚀 rke2 🛠️ misc,ha 🐧 suse

  • ☁️ capm3 🚀 rke2 🎬 sylva-upgrade-from-1.4.x 🛠️ ha,misc 🐧 suse

  • ☁️ capm3 🚀 kadm 🎬 rolling-update 🛠️ ha 🐧 suse

  • ☁️ capm3 🚀 ck8s 🎬 rolling-update 🛠️ ha 🐧 ubuntu

  • ☁️ capm3 🚀 rke2|okd 🎬 no-update 🐧 ubuntu|na

Global config for deployment pipelines

  • autorun pipelines
  • allow failure on pipelines
  • record sylvactl events

Notes:

  • Enabling autorun will make deployment pipelines to be run automatically without human interaction
  • Disabling allow failure will make deployment pipelines mandatory for pipeline success.
  • if both autorun and allow failure are disabled, deployment pipelines will need manual triggering but will be blocking the pipeline

Be aware: after configuration change, pipeline is not triggered automatically. Please run it manually (by clicking the run pipeline button in Pipelines tab) or push new code.

Edited by Alain Thioliere

Merge request reports

Loading