DNS Resolution problems on old Google Domains.

This is a possibly historical bug that may or may not be relevant, and I'm including it in this import of old issues from the other platform. Reasonable next steps would be to research the situation, then either implement permanent changes in Ansible or just close it out as no longer relevant.

Note that before/during this bug I was using Google Domains which has been sold off to Squarespace Domains so it may no longer be relevant. Or might have been part of the sale process LOL.

Can't install NPM packages on Node-RED because DNS queries timeout.

From the shell, "ping www.google.com" returns "kubectl rollout restart deployment -n nodered"

"nslookup www.google.com" returns a non-authoritative answer from 10.43.0.10, which is not cached because it fails if I try pinging again.

cat /etc/resolv.conf
search nodered.svc.cluster.local svc.cluster.local cluster.local cedar.mulhollon.com mulhollon.com
nameserver 10.43.0.10
options ndots:5

ping www.google.com. (note the FQDN trailing dot) instantly resolves and works.

General internet discussion indicates some DNS problems resolving external addresses can be eliminated by enabling the nodelocal DNS cache.

https://docs.rke2.io/networking#nodelocal-dnscache

I did that with no change in results (although nothing is worse; I think). Note you have to add the ipvs option if you have ipvs enabled (which I do).

General internet discussion indicates that the "options ndots: 5" is a problem with FQDN vs non-FQDN lookups WRT timeouts although my queries fail instantly, although "npm install node-red-contrib-protobuf" fails in about 90 seconds with "FetchError: request to https://registry.npmjs.org/node-red-contrib-protobuf failed, reason: getaddrinfo ENOTFOUND registry.npmjs.org"

Note that ping registry.npmjs.org fails and ping registry.npmjs.org. instantly resolves (and pings) and nslookup registry.npmjs.org resolves correctly and instantly.

Switched to a random other pod. A linuxserver/dokuwiki installation. In the shell, ping www.google.com returns "ping: bad address 'www.google.com'" and ping www.google.com. works instantly and successfully.

Methodically working thru https://ranchermanager.docs.rancher.com/troubleshooting/other-troubleshooting-tips/dns

kubectl -n kube-system get pods -l k8s-app=kube-dns
This works.

kubectl -n kube-system get svc -l k8s-app=kube-dns
This works.

kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default
"nslookup: can't resolve 'kubernetes.default'"

This fails which seems to be a problem.

Next step, restart the currently running pod. It was restarted 3 days ago during a K8S upgrade. The two rke2-coredns-rke2-coredns pods are maintained by the replicaset rke2-coredns-rke2-coredns. I restarted one pod, nothing interesting happened. Logs look normal on the new pod. The busybox DNS query to kubernetes.default still fails. I restarted the other pod, so now I have two freshly restarted pods. Logs look normal and boring on the second restarted pod. The pod images are rancher/hardened-coredns:v1.11.1-build20240305 The busybox query to kubernetes.default fails same as before.

kubectl -n kube-system get pods -l k8s-app=kube-dns
Looks normal, I have two pods.
NAME                                         READY   STATUS    RESTARTS   AGE
rke2-coredns-rke2-coredns-864fbd7785-5lmgs   1/1     Running   0          4m1s
rke2-coredns-rke2-coredns-864fbd7785-kv5zq   1/1     Running   0          6m26s

kubectl -n kube-system get svc -l k8s-app=kube-dns
NAME                        TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)         AGE
kube-dns-upstream           ClusterIP   10.43.71.75   <none>        53/UDP,53/TCP   51m
rke2-coredns-rke2-coredns   ClusterIP   10.43.0.10    <none>        53/UDP,53/TCP   50d
OK yes I set up nodelocal caching as part of the troubleshooting probably 51 minutes ago.

kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default
Server:    10.43.0.10
Address 1: 10.43.0.10 rke2-coredns-rke2-coredns.kube-system.svc.cluster.local

nslookup: can't resolve 'kubernetes.default'
pod "busybox" deleted
pod default/busybox terminated (Error)

Methodically going thru the "CoreDNS specific" DNS troubleshooting steps:

kubectl -n kube-system logs -l k8s-app=kube-dns
.:53
[INFO] plugin/reload: Running configuration SHA512 = c18591e7950724fe7f26bd172b7e98b6d72581b4a8fc4e5fc4cfd08229eea
58f4ad043c9fd3dbd1110a11499c4aa3164cdd63ca0dd5ee59651d61756c4f671b7
CoreDNS-1.11.1
linux/amd64, go1.20.14 X:boringcrypto, ae2bbc29
.:53
[INFO] plugin/reload: Running configuration SHA512 = c18591e7950724fe7f26bd172b7e98b6d72581b4a8fc4e5fc4cfd08229eea
58f4ad043c9fd3dbd1110a11499c4aa3164cdd63ca0dd5ee59651d61756c4f671b7
CoreDNS-1.11.1
linux/amd64, go1.20.14 X:boringcrypto, ae2bbc29

kubectl -n kube-system get configmap coredns -o go-template={{.data.Corefile}}
Error from server (NotFound): configmaps "coredns" not found

That could be a problem?

I checked the node-local-dns configmap and that looks reasonable: kubectl -n kube-system get configmap node-local-dns -o go-template={{.data.Corefile}} (It would be a long cut and paste, but it seems to forward to 10.43.0.10, which admittedly doesn't work)

Ah I see in the installed helm app for rke2-coredns the configmap name is actually named rke2-coredns-rke2-coredns, OK

kubectl -n kube-system get configmap rke2-coredns-rke2-coredns -o go-template={{.data.Corefile
}}
.:53 {
    errors 
    health  {
        lameduck 5s
    }
    ready 
    kubernetes   cluster.local  cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
        ttl 30
    }
    prometheus   0.0.0.0:9153
    forward   . /etc/resolv.conf
    cache   30
    loop 
    reload 
    loadbalance 
}

This seems reasonable?

Docs suggest checking the upstream nameservers kubectl run -i --restart=Never --rm test-${RANDOM} --image=ubuntu --overrides='{"kind":"Pod", "apiVersion":"v1", "spec": {"dnsPolicy":"Default"}}' -- sh -c 'cat /etc/resolv.conf' This matches the configuration successfully used by 78 hosts configured by Ansible, looks good.

Not feeling confident about enabling query logging as I'm not entirely sure how to shut it off after I enable it, at least on RKE2.

Well, seem stuck. Lets take a step back and restate the problem and think about the above.

Summary of problem.

From inside a NodeRED container (or any other)

Can't install packages with registry.npmjs.org because no DNS resolution.

ping www.google.com fails to resolve
ping www.google.com. works
nslookup www.google.com works

nslookup kubernetes.default fails
ping kubernetes.default works and resolves to kubernetes.default.svc.cluster.local
ping kubernetes.default.svc.cluster.local works
nslookup kubernetes.default.svc.cluster.local works

OK lets try to pretend to be a K8S pod on a VM, and try all the search paths.

From a VM lets try the google.com.cedar.mulhollon.com search path. That is my Active Directory domain controller and it returns a NXDOMAIN OK fine.

Then try google.com.mulhollon.com. That domain is hosted by Google and it returns a valid NOERROR but no answer. I think this is the problem. The search protocol for DNS takes the first answer it gets and it searches whatever.mulhollon.com and that'll blackhole all incoming queries so I need to remove it from the search path. OK then.

I can now replicate the problem in a VM.

Ooof I can now replicate the solution in a K8S pod. I don't have root and can't edit the /etc/resolv.conf file in my NodeRED containers. I found a container I can log into as root and mess with config files. With mulhollon.com (hosted at google) if I try to ping www.google.com I get "bad address" because google domain hosting blackholes missing A records, weird but true. If I edit /etc/resolv.conf and remove mulhollon.com from the search path, SUCCESS! I can now resolve and ping www.google.com. In fact I can ping registry.npmjs.org so that implies I can probably use it (although this isn't a nodejs container)

In the old days, I had everything in the domain mulhollon.com, then I gradually rolled everything internal into active directory hosted cedar.mulhollon.com and now I have nothing but external internet services on mulhollon.com. In the interim while I was setting up AD I needed both domains in my DNS search path; but I don't think I need that any longer.

Some quality Ansible time ensued while the entire LAN has its DNS search path "adjusted". I changed the search path in resolv.conf and had ansible apply it to the entire RKE2 cluster in a couple minutes. Verified the changes, and did a "kubectl rollout restart deployment -n nodered" which wiped and recreated the NodeRED farm (without deleting the PVs or PVCs, K8S is cool). Connect to the shell of a random container, /etc/resolv.conf inherited "live" without any RKE reboot or other restart required. ping www.google.com works now that the DNS blackhole is no longer in the search path.

So that was fun.