Document expected failure/retry behavior (and maybe make it configurable?)
Following up from #96 (closed) it appears that part of our problem when attempting to deploy cert-manager
(and a default ClusterIssuer
resource) from the same gitops repo is that the webhook server can take some time to startup. This results in errors like this during apply:
error when creating "manifests/default_cert-manager.io_v1_clusterissuer_letsencrypt-prod.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post "https://cert-manager-webhook.appdat-system.svc:443/mutate?timeout=10s": dial tcp 10.100.49.50:443: connect: connection refused
All that really needs to be done in order to make it work is to wait/retry the apply for the ClusterIssuer
resource. As far as we have been able to tell the agent does not attempt any retries when resource creation fails. We also haven't quite figured out what the expected behaviour is when an apply fails. Does the agent attempt to prune resources from the failed sync or should we expect anything that was successfully applied to persist?
We are ok with the workaround of applying the cert-manager
deployment and dependant resources in separate loops (manifest_projects
entries) for now, but in order for that to be viable the loop that's failing needs to be retried and we don't see this happening?
I can't tell exactly how #15 (closed) was implemented, but it sounds like retries were disabled in order to avoid infinite loops? This makes sense as a default but perhaps we could have some sort of option to enable exponential backoff retries as well?
Thanks in advance!
/cc @ash2k