Skip to content

Replace the first master node of the worker cluster

This page will take a highly available three-master-node worker cluster as an example. When the first master node of the worker cluster fails or malfunctions, how to replace or reintroduce the first master node.

This page features a highly available cluster with three master nodes.

  • node1 (172.30.41.161)
  • node2 (172.30.41.162)
  • node3 (172.30.41.163)

Assuming node1 is down, the following steps will explain how to reintroduce the recovered node1 back into the worker cluster.

Preparations

Before performing the replacement operation, first obtain basic information about the cluster resources, which will be used when modifying related configurations.

Note

The following commands to obtain cluster resource information are executed in the management cluster.

  1. Get the cluster name

    Execute the following command to find the clusters.kubean.io resource corresponding to the cluster:

    # For example, if the resource name of clusters.kubean.io is cluster-mini-1
    # Get the name of the cluster
    CLUSTER_NAME=$(kubectl get clusters.kubean.io cluster-mini-1 -o=jsonpath="{.metadata.name}{'\n'}")
    
  2. Get the host list configmap of the cluster

    kubectl get clusters.kubean.io cluster-mini-1 -o=jsonpath="{.spec.hostsConfRef}{'\n'}"
    {"name":"mini-1-hosts-conf","namespace":"kubean-system"}
    
  3. Get the configuration parameters configmap of the cluster

    kubectl get clusters.kubean.io cluster-mini-1 -o=jsonpath="{.spec.varsConfRef}{'\n'}"
    {"name":"mini-1-vars-conf","namespace":"kubean-system"}
    

Steps

  1. Adjust the order of control plane nodes

    Reset the node1 node to restore it to the state before installing the cluster (or use a new node), maintaining the network connectivity of the node1 node.

    Adjust the order of the node1 node in the kube_control_plane, kube_node, and etcd sections in the host list (node1/node2/node3 -> node2/node3/node1):

    function change_control_plane_order() {
      cat << EOF | kubectl apply -f -
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: mini-1-hosts-conf
    namespace: kubean-system
    data:
    hosts.yml: |
        all:
        hosts:
            node1:
            ip: "172.30.41.161"
            access_ip: "172.30.41.161"
            ansible_host: "172.30.41.161"
            ansible_connection: ssh
            ansible_user: root
            ansible_password: dangerous
            node2:
            ip: "172.30.41.162"
            access_ip: "172.30.41.162"
            ansible_host: "172.30.41.162"
            ansible_connection: ssh
            ansible_user: root
            ansible_password: dangerous
            node3:
            ip: "172.30.41.163"
            access_ip: "172.30.41.163"
            ansible_host: "172.30.41.163"
            ansible_connection: ssh
            ansible_user: root
            ansible_password: dangerous
        children:
            kube_control_plane:
            hosts:
                node2:
                node3:
                node1:
            kube_node:
            hosts:
                node2:
                node3:
                node1:
            etcd:
            hosts:
                node2:
                node3:
                node1:
            k8s_cluster:
            children:
                kube_control_plane:
                kube_node:
            calico_rr:
            hosts: {}
    EOF
    }
    
    change_control_plane_order
    
  2. Remove the first master node in an abnormal state

    After adjusting the order of nodes in the host list, remove the node1 in an abnormal state of the K8s control plane.

    Note

    If node1 is offline or malfunctioning, the following configuration items must be added to extraArgs, you need not to add them when node1 is online.

    reset_nodes=false # Skip resetting node operation
    allow_ungraceful_removal=true # Allow ungraceful removal operation
    
    # Image spray-job can use an accelerator address here
    
    SPRAY_IMG_ADDR="ghcr.m.daocloud.io/kubean-io/spray-job"
    SPRAY_RLS_2_22_TAG="2.22-336b323"
    KUBE_VERSION="v1.24.14"
    CLUSTER_NAME="cluster-mini-1"
    REMOVE_NODE_NAME="node1"
    
    cat << EOF | kubectl apply -f -
    ---
    apiVersion: kubean.io/v1alpha1
    kind: ClusterOperation
    metadata:
      name: cluster-mini-1-remove-node-ops
    spec:
      cluster: ${CLUSTER_NAME}
      image: ${SPRAY_IMG_ADDR}:${SPRAY_RLS_2_22_TAG}
      actionType: playbook
      action: remove-node.yml
      extraArgs: -e node=${REMOVE_NODE_NAME} -e reset_nodes=false -e allow_ungraceful_removal=true -e kube_version=${KUBE_VERSION}
      postHook:
        - actionType: playbook
          action: cluster-info.yml
    EOF
    
  3. Manually modify the cluster configuration, edit and update cluster-info

    # Edit cluster-info
    kubectl -n kube-public edit cm cluster-info
    
    # 1. If the ca.crt certificate is updated, the content of the certificate-authority-data field needs to be updated
    # View the base64 encoding of the ca certificate:
    cat /etc/kubernetes/ssl/ca.crt | base64 | tr -d '\n'
    
    # 2. Change the IP address in the server field to the new first master IP, this document will use the IP address of node2, 172.30.41.162
    
  4. Manually modify the cluster configuration, edit and update kubeadm-config

    # Edit kubeadm-config
    kubectl -n kube-system edit cm kubeadm-config
    
    # Change controlPlaneEndpoint to the new first master IP,
    # this document will use the IP address of node2, 172.30.41.162
    
  5. Scale up the master node and update the cluster

    Note

    • Use --limit to limit the update operation to only affect the etcd and kube_control_plane node groups.
    • If it is an offline environment, spec.preHook needs to add enable-repo.yml, and the extraArgs parameter should fill in the correct repo_list for the related OS.
    cat << EOF | kubectl apply -f -
    ---
    apiVersion: kubean.io/v1alpha1
    kind: ClusterOperation
    metadata:
      name: cluster-mini-1-update-cluster-ops
    spec:
      cluster: ${CLUSTER_NAME}
      image: ${SPRAY_IMG_ADDR}:${SPRAY_RLS_2_22_TAG}
      actionType: playbook
      action: cluster.yml
      extraArgs: --limit=etcd,kube_control_plane -e kube_version=${KUBE_VERSION}
      preHook:
        - actionType: playbook
          action: enable-repo.yml  # This yaml needs to be added in an offline environment,
                                   # and set the correct repo-list (install operating system packages),
                                   # the following parameter values are for reference only
          extraArgs: |
            -e "{repo_list: ['http://172.30.41.0:9000/kubean/centos/\$releasever/os/\$basearch','http://172.30.41.0:9000/kubean/centos-iso/\$releasever/os/\$basearch']}"
      postHook:
        - actionType: playbook
          action: cluster-info.yml
    EOF
    

This completes the replacement of the first Master node.

Comments