- Related issues: #2243 [BUG] rancher-monitoring is unusable when hosting NODE is (accidently) down
Category:
- Monitoring
Verification Steps
- Install a two nodes harvester cluster
- Check the Initial state of the 2 nodes Harvester cluster
harv-node1-0719:~ # kubectl get nodes
NAME STATUS ROLES AGE VERSION
harv-node1-0719 Ready control-plane,etcd,master 36m v1.21.11+rke2r1
harv-node2-0719 Ready <none>
harv-node1-0719:~ # kubectl get pods -A | grep monitoring
cattle-monitoring-system prometheus-rancher-monitoring-prometheus-0 3/3 Running 0 33m
cattle-monitoring-system rancher-monitoring-grafana-d9c56d79b-ckbjc 3/3 Running 0 33m
harv-node1-0719:~ # kubectl get pods prometheus-rancher-monitoring-prometheus-0 -n cattle-monitoring-system -o yaml | grep nodeName
nodeName: harv-node1-0719
-
Power off both nodes
-
Power on control-plane node (node 1) first, after 10 seconds, power on compute-node (node2)
-
After reboot,
prometheus-rancher-monitoring-prometheus-0
change tonode2
harv-node1-0719:~ # kubectl get pods prometheus-rancher-monitoring-prometheus-0 -n cattle-monitoring-system -o yaml | grep nodeName nodeName: harv-node2-0719
-
Power off compute-node, the “rancher monitoring” is unusable.
harv-node1-0719:~ # kubectl get pods -n cattle-monitoring-system NAME READY STATUS RESTARTS AGE prometheus-rancher-monitoring-prometheus-0 0/3 Init:0/1 0 4s rancher-monitoring-grafana-d9c56d79b-cgkwx 3/3 Terminating 0 25m rancher-monitoring-grafana-d9c56d79b-db4w8 0/3 Init:0/2 0 12m rancher-monitoring-kube-state-metrics-5bc8bb48bd-gc6xf 1/1 Running 1 81m rancher-monitoring-operator-559767d69b-xffrj 1/1 Running 1 81m rancher-monitoring-prometheus-adapter-8846d4757-w6xds 1/1 Running 1 81m rancher-monitoring-prometheus-node-exporter-nrx75 1/1 Running 1 69m rancher-monitoring-prometheus-node-exporter-zn8s2 1/1 Running 1 81m
-
Check the recovery workaround provided on the monitoring document:
-
Monitoring can be recovered via using CLI commands to delete related PODs forcely, the cluster will deploy new PODs to replace them.
Delete each none-running POD in namespace cattle-monitoring-system. $ kubectl delete pod --force -n cattle-monitoring-system prometheus-rancher-monitoring-prometheus-0 harv-node1-0719:~ # kubectl delete pod --force -n cattle-monitoring-system prometheus-rancher-monitoring-prometheus-0 warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "prometheus-rancher-monitoring-prometheus-0" force deleted
-
Delete all cattle-monitoring-system related pods
$ kubectl delete pod --force -n cattle-monitoring-system rancher-monitoring-grafana-d9c56d79b-db4w8 $ kubectl delete pod --force -n cattle-monitoring-system rancher-monitoring-kube-state-metrics-5bc8bb48bd-gc6xf $ kubectl delete pod --force -n cattle-monitoring-system rancher-monitoring-operator-559767d69b-xffrj $ kubectl delete pod --force -n cattle-monitoring-system rancher-monitoring-prometheus-adapter-8846d4757-w6xds $ kubectl delete pod --force -n cattle-monitoring-system rancher-monitoring-prometheus-node-exporter-nrx75 $ kubectl delete pod --force -n cattle-monitoring-system rancher-monitoring-prometheus-node-exporter-zn8s2
-
Check the rancher monitoring is running
harv-node1-0719:~ # kubectl get pods -n cattle-monitoring-system NAME READY STATUS RESTARTS AGE prometheus-rancher-monitoring-prometheus-0 3/3 Running 0 118s rancher-monitoring-grafana-d9c56d79b-zfh6r 0/3 Init:0/2 0 82s rancher-monitoring-kube-state-metrics-5bc8bb48bd-lmq7l 1/1 Running 0 65s rancher-monitoring-operator-559767d69b-94vnt 1/1 Running 0 54s rancher-monitoring-prometheus-adapter-8846d4757-6v24v 1/1 Running 0 45s rancher-monitoring-prometheus-node-exporter-j6v5j 0/1 Pending 0 32s rancher-monitoring-prometheus-node-exporter-xhhk4 1/1 Running 0 23s
-
Wait for several minutes, the
prometheus-rancher-monitoring-prometheus-0
andrancher-monitoring-grafana
recreatedharv-node1-0719:~ # kubectl get pods -n cattle-monitoring-system NAME READY STATUS RESTARTS AGE prometheus-rancher-monitoring-prometheus-0 3/3 Running 0 8m20s rancher-monitoring-grafana-d9c56d79b-zfh6r 3/3 Running 0 7m44s rancher-monitoring-kube-state-metrics-5bc8bb48bd-lmq7l 1/1 Running 0 7m27s rancher-monitoring-operator-559767d69b-94vnt 1/1 Running 0 7m16s rancher-monitoring-prometheus-adapter-8846d4757-6v24v 1/1 Running 0 7m7s rancher-monitoring-prometheus-node-exporter-j6v5j 0/1 Pending 0 6m54s rancher-monitoring-prometheus-node-exporter-xhhk4 1/1 Running 0 6m45s
-
Check the rancher-monitoring chart can display on the dashboard page
Expected Results
-
This issue exists on latest master release
-
We can use the workaround steps provided in the monitoring document by deleting all the
cattle-monitoring-system
related pods -
Can recover the rancher-monitoring chart when the hosting node is accidentally down.
-
Rancher-monitoring pod back to Running state
NAME READY STATUS RESTARTS AGE prometheus-rancher-monitoring-prometheus-0 3/3 Running 0 8m20s rancher-monitoring-grafana-d9c56d79b-zfh6r 3/3 Running 0 7m44s