version.gitlab.com certificate must be renewed before it is expires on Sat Mar 5 at 12:00UTC

Related to #4.

The certificate on version.gitlab.com expires on Sat, 05 Mar 2022 12:00 UTC. If it expires, we will no longer get service ping data which will be an S1 incident.

There currently isn't a clear plan on who has the knowledge and the access to renew the certificate.

Ref: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/4#note_857860893

Summary

What happened?

The cluster got auto-upgraded to Kubernetes version 1.20 which broke the cert-manager installation.

What needs to be done?

Cert-manager needs to be upgraded through a breaking change (from v0.9.x to v1.7.x). This means it needs to be

  1. backed up
  2. fully removed
  3. re-installed
  4. restored

But since the installation was last touched with Helm v2, and Helm v2 is very old and does not (officially) support K8s 1.20, we will migrate the release metadata to Helm v3 first.

Plan

Step 1: Migrate to Helm v3

Helm v2 is very old and does not support k8s 1.20. For the best outcome, we should migrate to Helm v3 first.

# Ensure that `helm` points to V3
# VERIFY THE OUTPUT
helm version

# Install Helm v2 in PATH. This is on cloud shell, so we install under $HOME so it persists across sessions
https://get.helm.sh/helm-v2.17.0-linux-amd64.tar.gz
tar xf helm-v2.17.0-linux-amd64.tar.gz
mv linux-amd64/helm ~/bin/helm2
mv linux-amd64/tiller ~/bin/tiller

# Install helm-2to3 plugin. This also gets installed under $HOME
helm plugin install https://github.com/helm/helm-2to3.git

# Init local tiller
export KUBE_NAMESPACE=gitlab-managed-apps
export TILLER_NAMESPACE=$KUBE_NAMESPACE
export HELM_HOST="localhost:44134"
tiller -listen "$HELM_HOST" &
helm2 init --client-only

# Grab releases to be migrated
releases=$(helm2 ls --output json | jq -r '.Releases[].Name')

# Adopt all resources with annotations and labels in case they are not part of the persisted release data
for release in $releases; do
    chart=$(helm2 ls "^$release\$" --output json | jq -r '.Releases[0].Chart')
    echo "Adopting Helm v2 manifests from $release"
    # some resource kinds must be listed explicitly https://github.com/kubernetes/kubernetes/issues/42885
    for name in $(kubectl -n "$KUBE_NAMESPACE" get all,ingress,daemonset -o name -l chart="$chart"); do
        kubectl annotate -n "$KUBE_NAMESPACE" --overwrite "$name" meta.helm.sh/release-name="$release"
        kubectl annotate -n "$KUBE_NAMESPACE" --overwrite "$name" meta.helm.sh/release-namespace="$KUBE_NAMESPACE"
        kubectl label -n "$KUBE_NAMESPACE" --overwrite "$name" app.kubernetes.io/managed-by=Helm
    done
done

# Migrate each release
for release in $releases; do
    echo "Migrating release: $release"
    helm 2to3 convert --ignore-already-migrated --release-storage configmaps --tiller-out-cluster --tiller-ns "$TILLER_NAMESPACE" "$release"
done

# Kill Tiller so we don't acidentally use Helm 2 during the next steps
killall tiller

Step 2: Fix cert-manager

  1. Back-up as per https://cert-manager.io/docs/installation/upgrading/upgrading-0.10-0.11/. If all goes well, we are not going to use this backup, but we keep it around for reference just in case.

  2. Prepare an updated cluster-issuer based on the old issuer:

    # issuer.yaml
    apiVersion: cert-manager.io/v1
    kind: ClusterIssuer
    metadata:
      name: letsencrypt-prod
    spec:
      acme:
        email: dsylva@gitlab.com
        solvers:
          - http01:
              ingress:
                class: nginx
        privateKeySecretRef:
          name: letsencrypt-prod
        server: https://acme-v02.api.letsencrypt.org/directory
  3. Uninstall cert-manager

    # review the resources to delete
    kubectl get Issuers,ClusterIssuers,Certificates,CertificateRequests,Orders,Challenges --all-namespaces
    
    # if all good, delete them
    kubectl delete Issuers,ClusterIssuers,Certificates,CertificateRequests,Orders,Challenges --all --all-namespaces
    
    # Uninstall certmanager. This is using Helm v3, so there is no --purge
    helm -n gitlab-managed-apps uninstall certmanager
    
    # Remove the legacy CRDs
    kubectl delete -f https://raw.githubusercontent.com/jetstack/cert-manager/release-0.9/deploy/manifests/00-crds.yaml
  4. Re-install certmanager, and immediately apply the issuer as a follow up.

    helm repo add jetstack https://charts.jetstack.io
    helm repo update
    
    helm upgrade --install \
        certmanager jetstack/cert-manager \
        --namespace gitlab-managed-apps \
        --version v1.7.1 \
        --set installCRDs=true \
        --set ingressShim.defaultIssuerKind=ClusterIssuer \
        --set ingressShim.defaultIssuerName=letsencrypt-prod
    
    kubectl apply -f issuer.yaml

Step 3: Cleanup

If everything went well, then we can relatively safely remove the Helm v2 release data.

# Init local tiller
export KUBE_NAMESPACE=gitlab-managed-apps
export TILLER_NAMESPACE=$KUBE_NAMESPACE
export HELM_HOST="localhost:44134"
tiller -listen "$HELM_HOST" &
helm2 init --client-only

# Delete Helm 2 release data
helm 2to3 cleanup --skip-confirmation --release-storage configmaps --tiller-out-cluster --tiller-ns "$TILLER_NAMESPACE"

# Kill local Tiller
killall tiller
Edited by Hordur Freyr Yngvason