Leader election stops permanently if agent temporarily loses connection to Kubernetes API
When developing the Category:Remote Development feature, we noticed that in local development, when you restart the Kubernetes cluster, the remote development module never starts again.
Deprecated investigation
Upon closer inspection, this seems to be thanks to the leader election logic.
The leaderelection.LeaderElector
fails to obtain a lock so it fails to renew the leader lease and it stops leading.
{"level":"info","time":"2023-05-04T12:16:15.781+0200","msg":"failed to renew lease default/agent-1-lock: timed out waiting for the condition\n","agent_id":1}
{"level":"info","time":"2023-05-04T12:16:15.781+0200","msg":"Event occurred","object":{"name":"agent-1-lock","namespace":"default"},"fieldPath":"","kind":"Lease","apiVersion":"coordination.k8s.io/v1","type":"Normal","reason":"LeaderElection","message":"agentk stopped leading","agent_id":1}
The LeaderElector
tries to access the lock several times (roughly 2 minutes) before giving up permanently. The agent will never try to become a leader again. (notice the retry limit exceeded!
at the end of the output)
{"level":"error","time":"2023-05-04T14:28:50.106+0200","msg":"Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"agent-1-lock.175bf078a10ee188\", GenerateName:\"\", Namespace:\"default\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:\"Lease\", Namespace:\"default\", Name:\"agent-1-lock\", UID:\"3934ee8a-2419-41cd-b5ac-31097f2eb9b9\", APIVersion:\"coordination.k8s.io/v1\", ResourceVersion:\"2822964\", FieldPath:\"\"}, Reason:\"LeaderElection\", Message:\"agentk stopped leading\", Source:v1.EventSource{Component:\"gitlab-agent\", Host:\"\"}, FirstTimestamp:time.Date(2023, time.May, 4, 14, 28, 6, 641861000, time.Local), LastTimestamp:time.Date(2023, time.May, 4, 14, 28, 6, 641861000, time.Local), Count:1, Type:\"Normal\", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:\"\", Related:(*v1.ObjectReference)(nil), ReportingController:\"\", ReportingInstance:\"\"}': 'Post \"https://127.0.0.1:6443/api/v1/namespaces/default/events\": dial tcp 127.0.0.1:6443: connect: connection refused'(may retry after sleeping)\n","agent_id":1}
{"level":"error","time":"2023-05-04T14:29:00.110+0200","msg":"Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"agent-1-lock.175bf078a10ee188\", GenerateName:\"\", Namespace:\"default\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:\"Lease\", Namespace:\"default\", Name:\"agent-1-lock\", UID:\"3934ee8a-2419-41cd-b5ac-31097f2eb9b9\", APIVersion:\"coordination.k8s.io/v1\", ResourceVersion:\"2822964\", FieldPath:\"\"}, Reason:\"LeaderElection\", Message:\"agentk stopped leading\", Source:v1.EventSource{Component:\"gitlab-agent\", Host:\"\"}, FirstTimestamp:time.Date(2023, time.May, 4, 14, 28, 6, 641861000, time.Local), LastTimestamp:time.Date(2023, time.May, 4, 14, 28, 6, 641861000, time.Local), Count:1, Type:\"Normal\", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:\"\", Related:(*v1.ObjectReference)(nil), ReportingController:\"\", ReportingInstance:\"\"}': 'Post \"https://127.0.0.1:6443/api/v1/namespaces/default/events\": dial tcp 127.0.0.1:6443: connect: connection refused'(may retry after sleeping)\n","agent_id":1}
{"level":"error","time":"2023-05-04T14:29:10.112+0200","msg":"Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"agent-1-lock.175bf078a10ee188\", GenerateName:\"\", Namespace:\"default\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:\"Lease\", Namespace:\"default\", Name:\"agent-1-lock\", UID:\"3934ee8a-2419-41cd-b5ac-31097f2eb9b9\", APIVersion:\"coordination.k8s.io/v1\", ResourceVersion:\"2822964\", FieldPath:\"\"}, Reason:\"LeaderElection\", Message:\"agentk stopped leading\", Source:v1.EventSource{Component:\"gitlab-agent\", Host:\"\"}, FirstTimestamp:time.Date(2023, time.May, 4, 14, 28, 6, 641861000, time.Local), LastTimestamp:time.Date(2023, time.May, 4, 14, 28, 6, 641861000, time.Local), Count:1, Type:\"Normal\", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:\"\", Related:(*v1.ObjectReference)(nil), ReportingController:\"\", ReportingInstance:\"\"}': 'Post \"https://127.0.0.1:6443/api/v1/namespaces/default/events\": dial tcp 127.0.0.1:6443: connect: connection refused'(may retry after sleeping)\n","agent_id":1}
{"level":"error","time":"2023-05-04T14:29:20.114+0200","msg":"Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"agent-1-lock.175bf078a10ee188\", GenerateName:\"\", Namespace:\"default\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:\"Lease\", Namespace:\"default\", Name:\"agent-1-lock\", UID:\"3934ee8a-2419-41cd-b5ac-31097f2eb9b9\", APIVersion:\"coordination.k8s.io/v1\", ResourceVersion:\"2822964\", FieldPath:\"\"}, Reason:\"LeaderElection\", Message:\"agentk stopped leading\", Source:v1.EventSource{Component:\"gitlab-agent\", Host:\"\"}, FirstTimestamp:time.Date(2023, time.May, 4, 14, 28, 6, 641861000, time.Local), LastTimestamp:time.Date(2023, time.May, 4, 14, 28, 6, 641861000, time.Local), Count:1, Type:\"Normal\", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:\"\", Related:(*v1.ObjectReference)(nil), ReportingController:\"\", ReportingInstance:\"\"}': 'Post \"https://127.0.0.1:6443/api/v1/namespaces/default/events\": dial tcp 127.0.0.1:6443: connect: connection refused'(may retry after sleeping)\n","agent_id":1}
{"level":"error","time":"2023-05-04T14:29:30.117+0200","msg":"Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"agent-1-lock.175bf078a10ee188\", GenerateName:\"\", Namespace:\"default\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:\"Lease\", Namespace:\"default\", Name:\"agent-1-lock\", UID:\"3934ee8a-2419-41cd-b5ac-31097f2eb9b9\", APIVersion:\"coordination.k8s.io/v1\", ResourceVersion:\"2822964\", FieldPath:\"\"}, Reason:\"LeaderElection\", Message:\"agentk stopped leading\", Source:v1.EventSource{Component:\"gitlab-agent\", Host:\"\"}, FirstTimestamp:time.Date(2023, time.May, 4, 14, 28, 6, 641861000, time.Local), LastTimestamp:time.Date(2023, time.May, 4, 14, 28, 6, 641861000, time.Local), Count:1, Type:\"Normal\", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:\"\", Related:(*v1.ObjectReference)(nil), ReportingController:\"\", ReportingInstance:\"\"}': 'Post \"https://127.0.0.1:6443/api/v1/namespaces/default/events\": dial tcp 127.0.0.1:6443: connect: connection refused'(may retry after sleeping)\n","agent_id":1}
{"level":"error","time":"2023-05-04T14:29:40.119+0200","msg":"Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"agent-1-lock.175bf078a10ee188\", GenerateName:\"\", Namespace:\"default\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:\"Lease\", Namespace:\"default\", Name:\"agent-1-lock\", UID:\"3934ee8a-2419-41cd-b5ac-31097f2eb9b9\", APIVersion:\"coordination.k8s.io/v1\", ResourceVersion:\"2822964\", FieldPath:\"\"}, Reason:\"LeaderElection\", Message:\"agentk stopped leading\", Source:v1.EventSource{Component:\"gitlab-agent\", Host:\"\"}, FirstTimestamp:time.Date(2023, time.May, 4, 14, 28, 6, 641861000, time.Local), LastTimestamp:time.Date(2023, time.May, 4, 14, 28, 6, 641861000, time.Local), Count:1, Type:\"Normal\", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:\"\", Related:(*v1.ObjectReference)(nil), ReportingController:\"\", ReportingInstance:\"\"}': 'Post \"https://127.0.0.1:6443/api/v1/namespaces/default/events\": dial tcp 127.0.0.1:6443: connect: connection refused'(may retry after sleeping)\n","agent_id":1}
{"level":"error","time":"2023-05-04T14:29:50.121+0200","msg":"Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"agent-1-lock.175bf078a10ee188\", GenerateName:\"\", Namespace:\"default\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:\"Lease\", Namespace:\"default\", Name:\"agent-1-lock\", UID:\"3934ee8a-2419-41cd-b5ac-31097f2eb9b9\", APIVersion:\"coordination.k8s.io/v1\", ResourceVersion:\"2822964\", FieldPath:\"\"}, Reason:\"LeaderElection\", Message:\"agentk stopped leading\", Source:v1.EventSource{Component:\"gitlab-agent\", Host:\"\"}, FirstTimestamp:time.Date(2023, time.May, 4, 14, 28, 6, 641861000, time.Local), LastTimestamp:time.Date(2023, time.May, 4, 14, 28, 6, 641861000, time.Local), Count:1, Type:\"Normal\", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:\"\", Related:(*v1.ObjectReference)(nil), ReportingController:\"\", ReportingInstance:\"\"}': 'Post \"https://127.0.0.1:6443/api/v1/namespaces/default/events\": dial tcp 127.0.0.1:6443: connect: connection refused'(may retry after sleeping)\n","agent_id":1}
{"level":"error","time":"2023-05-04T14:29:50.121+0200","msg":"Unable to write event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"agent-1-lock.175bf078a10ee188\", GenerateName:\"\", Namespace:\"default\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:\"Lease\", Namespace:\"default\", Name:\"agent-1-lock\", UID:\"3934ee8a-2419-41cd-b5ac-31097f2eb9b9\", APIVersion:\"coordination.k8s.io/v1\", ResourceVersion:\"2822964\", FieldPath:\"\"}, Reason:\"LeaderElection\", Message:\"agentk stopped leading\", Source:v1.EventSource{Component:\"gitlab-agent\", Host:\"\"}, FirstTimestamp:time.Date(2023, time.May, 4, 14, 28, 6, 641861000, time.Local), LastTimestamp:time.Date(2023, time.May, 4, 14, 28, 6, 641861000, time.Local), Count:1, Type:\"Normal\", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:\"\", Related:(*v1.ObjectReference)(nil), ReportingController:\"\", ReportingInstance:\"\"}' (retry limit exceeded!)\n","agent_id":1}
Steps to reproduce
- have a leader-only module enabled in your agent
- start the agent
- observe that the module works
- Make the cluster unavailable (e.g. turn off your local cluster, turn of wifi,..)
- observe the lock-related messages mentioned in this issue description
- Make the cluster available
- See that the agent doesn't try to get leadership and the leader-only module doesn't run anymore
Investigation Update: 11 May 2023
What is happening under the hood
There are multiple reasons that cause the agentk to be unable to recover from k8s cluster disconnects.
Let's begin with some observations on what happens once the cluster disconnects. Up and until the leader replica is able to maintain/renew its leadership claims, the elector.Run()
within leader_runner.go
is blocked at this line:
func (l *leaseLeaderElector) Run(ctx context.Context, onStartedLeading, onStoppedLeading func()) {
name, err := l.name(ctx)
if err != nil {..}
elector, err := leaderelection.NewLeaderElector(..)
if err != nil {..}
// this part is blocked while the replica is able to renew its leadership claims
// as soon as it is unable to renew its lease, this method will exit
elector.Run(ctx)
}
Once this cluster disconnects, the callback hook for onStoppedLeading
is executed within the elector
which causes <-leaderRunner.onStoppedLeading
to be available for execution within the main leaderRunner.Run
logic loop.
As a quick reminder, leaderRunner.Run()
contains a for loop with a select-case
block where the different events are subscribed (in the form of channels). This loop is responsible for updating the internal state of the leader_runner
based on different events.
So when leadership is lost, leaderRunner.Run()
starts executing the following logic:
case <-r.onStoppedLeading:
switch r.status {
...
case runningAndLeader:
for _, holder := range r.funcs {
holder.cancel()
}
for idx, holder := range r.funcs {
holder.cancel = func() {}
holder.wait() // BUG 1: the loop will first block here
holder.wait = func() {}
r.funcs[idx] = holder
}
...
}
As mentioned in the code, the first bug is that holder.wait()
will block. This is because leader-only modules do not rely on ctx.Done()
checks as a signal to exit. They rely on closing of the configuration channel to stop running.
To elaborate a bit more, the implication here is that the module may internally stop working if it propagates ctx
/ checks on ctx.Done()
internally, however it will never exit as the config channel is still not closed.
Assume bug 1 is fixed
Let's say this bug is fixed; either by making leader-only modules do ctx cancellation checks (hacky/inelegant fix) or by fixing this logic to ensure that modules are stopped via closed config channels (cleaner fix but more involved)
This will still not solve the problem of leader re-election attempts.
Assuming that the bug is fixed and the running modules are cancelled / stopped following the logic under case <-r.onStoppedLeading:
case, the logic will eventually start the elector
again:
case <-r.onStoppedLeading:
switch r.status {
...
case runningAndLeader:
for _, holder := range r.funcs {
holder.cancel()
}
for idx, holder := range r.funcs {
holder.cancel = func() {}
holder.wait() // BUG 1: the loop will first block here
holder.wait = func() {}
r.funcs[idx] = holder
}
...
if len(r.funcs) > 0 {
// the elector is started again
r.startElector() // nolint: contextcheck
} else {..}
...
}
However, for each stopped module, a callback channel <-r.stopFunc
will now be available and this is something that the leader_runner
subscribes to as well. For each stopped module, the following logic will be executed
case stop := <-r.stopFunc:
var stopFunc func()
holder, ok := r.funcs[stop.idx]
if ok {
holder.cancel()
delete(r.funcs, stop.idx)
stopFunc = holder.wait
if len(r.funcs) == 0 { // removed last function
r.electorCancel()
r.electorCancel = nil
r.status = stopping
}
} else { // stop() called more than once
stopFunc = func() {}
}
stop.waitForStop <- stopFunc
This will cancel the restarted elector
and eventually put the leader_runner
in a stopping
state. In this state, while leader_module_wrapper
may be able to schedule execution for when the agent becomes leader, the cancelled elector will prevent the replica from ever becoming the leader again
Summary of facts so far
- Based on the above findings, it seems like solving this problem requires touching some internal state machine logic within
leader_runner
which may not be so straightforward - Given how
leader_runner
reacts to k8s disconnects, we know that the contexts are cancelled so most leader-only modules should at least cease operation if they rely on passed contexts, even if they block on open config channels. So there should be little risk of a non-leader replica having on-going leader-only logic even if it is in a broken state - K8S disconnects should be a rare occurence (if at all), otherwise this would've been raised in the past, given that it can completely halt the working of leader-only modules beyond recovery
What can be done
- There is some amount of non-trivial logic that deals with cancelling running
elector
. Part of this may to do with not wanting to run leader election when all leader-only modules have been disabled through configuration. While this optimization makes sense, this is unlikely to ever be required as thestarboard_vulnerability
module (leader-only) is permanently enabled. Taking out this optimization may simplify theleader_runner
logic and may make it easier to add a fix for this issue - This bug is closely tied to the state machine logic which could be made brittle through manipulation. At the same time, the causal factors seem rare enough as this has not been observed before. So, perhaps this issue can be parked for the time being until something merits this change.