背景:

​ kubernetes集群在使用过程当中由于其承载的etcd出现性能瓶颈,因而需要将etcd的全量数据集中迁移至新的etcd集群当中,关于etcd的迁移有多种形式,因为考虑到新etcd集群与原etcd集群之间网络延迟与稳定性,因而使用了停机的全量迁移方式,经过验证停机时间可控制在5分钟之内。当然如果稳定的网络带宽与延迟的情况下可以借助使用依次添加新节点,随后摘除与踢出老节点这种方式进行etcd无停机的方式进行数据迁移。

需要注意的是:本次kubernetes 的master为3节点形式进行部署。在最后阶段需要将apiserver 所使用的etcd节点地址进行全部变更。

一、制作证书并拷贝新证书到相应新的etcd节点

 1# 本次制作的证书我使用cfssl
 2# ca-config.json文件是生成证书的配置文件,其中有若干个"profiles"分别对应了生成证书的配置与有效时间,这里我使用的是myprofile01该配置。
 3# cat  ca-config.json
 4{
 5    "signing":{
 6        "default":{
 7            "expiry":"876000h"
 8        },
 9        "profiles":{
10            "myprofile01":{
11                "usages":[
12                    "signing",
13                    "key encipherment",
14                    "server auth",
15                    "client auth"
16                ],
17                "expiry":"87600h"
18            },
19            "3years":{
20                "usages":[
21                    "signing",
22                    "key encipherment",
23                    "server auth",
24                    "client auth"
25                ],
26                "expiry":"26280h"
27            },
28            "5years":{
29                "usages":[
30                    "signing",
31                    "key encipherment",
32                    "server auth",
33                    "client auth"
34                ],
35                "expiry":"43800h"
36            },
37            "kubernetes":{
38                "usages":[
39                    "signing",
40                    "key encipherment",
41                    "server auth",
42                    "client auth"
43                ],
44                "expiry":"876000h"
45            }
46        }
47    }
48}
49
50# etcd证书配置的信任地址:
51# etcd-csr.json的信任地址我将新旧两个集群的etcd地址均填写了进去,新集群地址为159.{36..38}。
52# cat etcd-csr.json
53
54{
55  "CN": "etcd",
56  "hosts": [
57    "127.0.0.1",
58    "10.199.136.21",
59    "10.199.136.22",
60    "10.199.136.23",
61    "10.199.159.36",
62    "10.199.159.37",
63    "10.199.159.38"
64  ],
65  "key": {
66    "algo": "rsa",
67    "size": 2048
68  },
69  "names": [
70    {
71      "C": "CN",
72      "ST": "ZheJiang",
73      "L": "HangZhou",
74      "O": "k8s",
75      "OU": "System"
76    }
77  ]
78}
79
80# 使用cfssl命令生成etcd新地址。
81# cfssl gencert -ca=/root/kubedeploy/pki/ca.pem \
82  -ca-key=/root/kubedeploy/pki/ca-key.pem \
83  -config=/root/kubedeploy/pki/ca-config.json \
84  -profile=myprofile01  etcd-csr.json | cfssljson -bare etcd

二、配置并新建etcd 集群

(注:本次使用Systemd的方式进行管理并启动集群。如果使用docker 或者 kubernetes 方式进行管理原理类似。)

新etcd集群3个节点服务配置:

​ 新建的集群data-dir是etcd的数据目录,当initial-cluster-statenew时,会初始化该数据目录,否则会报错。如果为existing则表示该目录已经有文件,不进行初始化,需要知道的是:即使为new,时重启etcd当etcd发现目录下已经有数据也是不会初始化与清理所有数据的。

etcd01

 1# systemctl cat etcd.service
 2
 3/etc/systemd/system/etcd.service
 4
 5[Unit]
 6Description=Etcd Server
 7After=network.target
 8After=network-online.target
 9Wants=network-online.target
10Documentation=https://github.com/coreos
11
12[Service]
13Type=notify
14WorkingDirectory=/var/lib/etcd/
15ExecStart=/usr/k8s/bin/etcd \
16  --name=etcd01 \
17  --cert-file=/etc/etcd/ssl/etcd.pem \
18  --key-file=/etc/etcd/ssl/etcd-key.pem \
19  --peer-cert-file=/etc/etcd/ssl/etcd.pem \
20  --peer-key-file=/etc/etcd/ssl/etcd-key.pem \
21  --trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
22  --peer-trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
23  --initial-advertise-peer-urls=https://10.199.159.36:2380 \
24  --listen-peer-urls=https://10.199.159.36:2380 \
25  --listen-client-urls=https://10.199.159.36:2379,http://127.0.0.1:2379 \
26  --advertise-client-urls=https://10.199.159.36:2379 \
27  --initial-cluster-token=etcd-cluster-0 \
28  --initial-cluster=etcd01=https://10.199.159.36:2380,etcd02=https://10.199.159.37:2380,etcd03=https://10.199.159.38:2380 \
29  --initial-cluster-state=new \
30  --quota-backend-bytes=137438953472 \
31  --auto-compaction-retention=15 \
32  --enable-pprof=true \
33  --data-dir=/var/lib/etcd \
34  --debug=false \
35  --v=4
36Restart=on-failure
37RestartSec=5
38LimitNOFILE=65536
39
40[Install]
41WantedBy=multi-user.target

etcd02

 1# systemctl cat  etcd
 2
 3# /etc/systemd/system/etcd.service
 4
 5[Unit]
 6Description=Etcd Server
 7After=network.target
 8After=network-online.target
 9Wants=network-online.target
10Documentation=https://github.com/coreos
11
12[Service]
13Type=notify
14WorkingDirectory=/var/lib/etcd/
15ExecStart=/usr/k8s/bin/etcd \
16  --name=etcd02 \
17  --cert-file=/etc/etcd/ssl/etcd.pem \
18  --key-file=/etc/etcd/ssl/etcd-key.pem \
19  --peer-cert-file=/etc/etcd/ssl/etcd.pem \
20  --peer-key-file=/etc/etcd/ssl/etcd-key.pem \
21  --trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
22  --peer-trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
23  --initial-advertise-peer-urls=https://10.199.159.37:2380 \
24  --listen-peer-urls=https://10.199.159.37:2380 \
25  --listen-client-urls=https://10.199.159.37:2379,http://127.0.0.1:2379 \
26  --advertise-client-urls=https://10.199.159.37:2379 \
27  --initial-cluster-token=etcd-cluster-0 \
28  --initial-cluster=etcd01=https://10.199.159.36:2380,etcd02=https://10.199.159.37:2380,etcd03=https://10.199.159.38:2380 \
29  --initial-cluster-state=new \
30  --quota-backend-bytes=137438953472 \
31  --auto-compaction-retention=15 \
32  --enable-pprof=true \
33  --data-dir=/var/lib/etcd \
34  --debug=false \
35  --v=4
36Restart=on-failure
37RestartSec=5
38LimitNOFILE=65536
39
40[Install]
41WantedBy=multi-user.target

etcd03

 1# systemctl cat  etcd
 2
 3# /etc/systemd/system/etcd.service
 4
 5[Unit]
 6Description=Etcd Server
 7After=network.target
 8After=network-online.target
 9Wants=network-online.target
10Documentation=https://github.com/coreos
11
12[Service]
13Type=notify
14WorkingDirectory=/var/lib/etcd/
15ExecStart=/usr/k8s/bin/etcd \
16  --name=etcd03 \
17  --cert-file=/etc/etcd/ssl/etcd.pem \
18  --key-file=/etc/etcd/ssl/etcd-key.pem \
19  --peer-cert-file=/etc/etcd/ssl/etcd.pem \
20  --peer-key-file=/etc/etcd/ssl/etcd-key.pem \
21  --trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
22  --peer-trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
23  --initial-advertise-peer-urls=https://10.199.159.38:2380 \
24  --listen-peer-urls=https://10.199.159.38:2380 \
25  --listen-client-urls=https://10.199.159.38:2379,http://127.0.0.1:2379 \
26  --advertise-client-urls=https://10.199.159.38:2379 \
27  --initial-cluster-token=etcd-cluster-0 \
28  --initial-cluster=etcd01=https://10.199.159.36:2380,etcd02=https://10.199.159.37:2380,etcd03=https://10.199.159.38:2380 \
29  --initial-cluster-state=new \
30  --quota-backend-bytes=137438953472 \
31  --auto-compaction-retention=15 \
32  --enable-pprof=true \
33  --data-dir=/var/lib/etcd \
34  --debug=false \
35  --v=4
36Restart=on-failure
37RestartSec=5
38LimitNOFILE=65536
39
40[Install]
41WantedBy=multi-user.target

建议:在新建的etcd集群当中可以先进行一次集群初始化,并查看节点建立状态,与证书是否正常,并对新建立的etcd集群做一次bench 操作,跑一次性能基线

etcd 自带了check perf 工具:

1# ETCDCTL_API=3 etcdctl  --cacert=/etc/kubernetes/ssl/ca.pem   --cert=/etc/kubernetes/ssl/kubernetes.pem   --key=/etc/kubernetes/ssl/kubernetes-key.pem   --endpoints=https://10.199.159.36:2379,https://10.199.159.37:2379,https://10.199.159.38:2379 check perf --load="l" --auto-compact=true --auto-defrag=true

三、备份原Etcd集群数据

​ 为保证数据的一致性,此时在备份与还原etcd期间需要将kubernetes集群的api-server的 停止对外服务。

1# systemctl stop kube-apiserver.service

3.1、对原etcd集群进行数据备份,用于之后还原。

本次使用脚本方式进行备份,使用脚本备份:

 1# cat etcd_backup.sh
 2
 3#!/bin/bash
 4mkdir -p /root/kubedeploy/etcd_backup/
 5date;
 6CACERT="/etc/kubernetes/ssl/ca.pem"
 7CERT="/etc/kubernetes/ssl/kubernetes.pem"
 8EKY="/etc/kubernetes/ssl/kubernetes-key.pem"
 9ENDPOINTS="10.199.136.21:2379"
10
11ETCDCTL_API=3 etcdctl \
12--cacert="${CACERT}" --cert="${CERT}" --key="${EKY}" \
13--endpoints=${ENDPOINTS} \
14snapshot save /root/kubedeploy/etcd_backup/etcd-snapshot-`date +%Y%m%d`.db
15
16# 备份保留30天
17# find ./ -name *.db -mtime +30 -exec rm -f {} \;

拷贝/root/etcd_backup/etcd-snapshot-20210719.db该文件至新etcd集群的/root/kubedeploy/etcd_backup/ 目录下。

3.2、恢复数据

如果报:Error: data-dir "/var/lib/etcd" exists" 则需要手动删除 该目录以及目录及其下所有文件 "rm -rf /var/lib/etcd"

1# etcd01 机器上操作
2
3# ETCDCTL_API=3 etcdctl snapshot restore /root/etcd_backup/etcd-snapshot-20210719.db \
4  --name etcd01 \
5  --initial-cluster "etcd01=https://10.199.159.36:2380,etcd02=https://10.199.159.37:2380,etcd03=https://10.199.159.38:2380" \
6  --initial-cluster-token etcd-cluster-0 \
7  --initial-advertise-peer-urls https://10.199.159.36:2380  \
8  --data-dir=/var/lib/etcd
1# etcd02 机器上操作
2
3# ETCDCTL_API=3 etcdctl snapshot restore /root/etcd_backup/etcd-snapshot-20210719.db  \
4  --name etcd02 \
5  --initial-cluster "etcd01=https://10.199.159.36:2380,etcd02=https://10.199.159.37:2380,etcd03=https://10.199.159.38:2380" \
6  --initial-cluster-token etcd-cluster-0 \
7  --initial-advertise-peer-urls https://10.199.159.37:2380  \
8  --data-dir=/var/lib/etcd
1# etcd03 机器上操作
2
3# ETCDCTL_API=3 etcdctl snapshot restore /root/etcd_backup/etcd-snapshot-20210719.db  \
4  --name etcd03 \
5  --initial-cluster "etcd01=https://10.199.159.36:2380,etcd02=https://10.199.159.37:2380,etcd03=https://10.199.159.38:2380" \
6  --initial-cluster-token etcd-cluster-0 \
7  --initial-advertise-peer-urls https://10.199.159.38:2380  \
8  --data-dir=/var/lib/etcd

在该还原配置当中需要提及的注意事项:

  • 1)name需要与Systemd管理的etcd服务相匹配否则会出现无法启动的问题。
  • 2)initial-cluster-token是一个etcd集群通信令牌。需要集群节点一致。
  • 3)data-dir 需要确保目录不存在。如果已经存在则删除。

四、最后

4.1所有etcd节点启动etcd集群

1# systemctl start etcd && system enable etcd && systemd status etcd

4.2 etcd集群常见的管理命令

 1# 查看etcd节点与集群状态常用命令:
 2# ETCDCTL_API=3 etcdctl  --cacert=/etc/kubernetes/ssl/ca.pem   --cert=/etc/kubernetes/ssl/kubernetes.pem   --key=/etc/kubernetes/ssl/kubernetes-key.pem   --endpoints=https://10.199.159.36:2379,https://10.199.159.37:2379,https://10.199.159.38:2379 member list --write-out=table
 3+------------------+---------+--------+----------------------------+----------------------------+------------+
 4|        ID        | STATUS  |  NAME  |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
 5+------------------+---------+--------+----------------------------+----------------------------+------------+
 6|  f88d448e745d642 | started | etcd03 | https://10.199.159.38:2380 | https://10.199.159.38:2379 |      false |
 7| 80e78076af5885ea | started | etcd02 | https://10.199.159.37:2380 | https://10.199.159.37:2379 |      false |
 8| bcfccc6c28807a2a | started | etcd01 | https://10.199.159.36:2380 | https://10.199.159.36:2379 |      false |
 9+------------------+---------+--------+----------------------------+----------------------------+------------+
10
11# ETCDCTL_API=3 etcdctl  --cacert=/etc/kubernetes/ssl/ca.pem   --cert=/etc/kubernetes/ssl/kubernetes.pem   --key=/etc/kubernetes/ssl/kubernetes-key.pem   --endpoints=https://10.199.159.36:2379,https://10.199.159.37:2379,https://10.199.159.38:2379  endpoint status  --write-out=table
12+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
13|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
14+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
15| https://10.199.159.36:2379 | bcfccc6c28807a2a |  3.4.14 |  339 MB |      true |      false |        10 |    1866269 |            1866269 |        |
16| https://10.199.159.37:2379 | 80e78076af5885ea |  3.4.14 |  339 MB |     false |      false |        10 |    1866269 |            1866269 |        |
17| https://10.199.159.38:2379 |  f88d448e745d642 |  3.4.14 |  339 MB |     false |      false |        10 |    1866269 |            1866269 |        |
18+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
19
20# ETCDCTL_API=3 etcdctl  --cacert=/etc/kubernetes/ssl/ca.pem   --cert=/etc/kubernetes/ssl/kubernetes.pem   --key=/etc/kubernetes/ssl/kubernetes-key.pem   --endpoints=https://10.199.159.36:2379,https://10.199.159.37:2379,https://10.199.159.38:2379  endpoint health   --write-out=table
21+----------------------------+--------+-------------+-------+
22|          ENDPOINT          | HEALTH |    TOOK     | ERROR |
23+----------------------------+--------+-------------+-------+
24| https://10.199.159.36:2379 |   true | 11.811307ms |       |
25| https://10.199.159.37:2379 |   true | 11.771516ms |       |
26| https://10.199.159.38:2379 |   true | 12.811964ms |       |
27+----------------------------+--------+-------------+-------+
28
29# 查看当前所有的键信息
30# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/ssl/ca.pem   --cert=/etc/kubernetes/ssl/kubernetes.pem   --key=/etc/kubernetes/ssl/kubernetes-key.pem   --endpoints=https://10.199.159.36:2379,https://10.199.159.37:2379,https://10.199.159.38:2379 get "" --prefix=true --keys-only  |head
31/calico/ipam/v2/assignment/ipv4/block/172.30.10.192-26
32
33/calico/ipam/v2/assignment/ipv4/block/172.30.11.0-26
34
35/calico/ipam/v2/assignment/ipv4/block/172.30.12.128-26
36
37/calico/ipam/v2/assignment/ipv4/block/172.30.12.192-26
38
39/calico/ipam/v2/assignment/ipv4/block/172.30.124.128-26
40
41
42# 查看所有键的值
43# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/ssl/ca.pem   --cert=/etc/kubernetes/ssl/kubernetes.pem   --key=/etc/kubernetes/ssl/kubernetes-key.pem   --endpoints=https://10.199.159.36:2379,https://10.199.159.37:2379,https://10.199.159.38:2379 get "" --prefix=true 
44
45# 查看/registry/pods/ 该目录下的所有键值
46# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/ssl/ca.pem   --cert=/etc/kubernetes/ssl/kubernetes.pem   --key=/etc/kubernetes/ssl/kubernetes-key.pem   --endpoints=https://10.199.159.36:2379,https://10.199.159.37:2379,https://10.199.159.38:2379 get /registry/pods/ --prefix --keys-only
47
48# 查看/registry/pods/ 该目录下的键值信息
49# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/ssl/ca.pem   --cert=/etc/kubernetes/ssl/kubernetes.pem   --key=/etc/kubernetes/ssl/kubernetes-key.pem   --endpoints=https://10.199.159.36:2379,https://10.199.159.37:2379,https://10.199.159.38:2379 get /registry/pods/ --prefix
50
51#  查看集群KV的历史信息,需要相同才能保证集群的一致性
52# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/ssl/ca.pem   --cert=/etc/kubernetes/ssl/kubernetes.pem   --key=/etc/kubernetes/ssl/kubernetes-key.pem   --endpoints=https://10.199.159.36:2379,https://10.199.159.37:2379,https://10.199.159.38:2379 endpoint hashkv
53https://10.199.159.36:2379, 224322237
54https://10.199.159.37:2379, 224322237
55https://10.199.159.38:2379, 224322237
56
57# 查看集群是否有不健康或者报警
58# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/ssl/ca.pem   --cert=/etc/kubernetes/ssl/kubernetes.pem   --key=/etc/kubernetes/ssl/kubernetes-key.pem   --endpoints=https://10.199.159.36:2379,https://10.199.159.37:2379,https://10.199.159.38:2379 alarm list

4.3修改ApiServer服务所对接etcd的端点信息,并重新启动apiserver。

1# cat  /etc/systemd/system/kube-apiserver.service
2
3...略
4  --etcd-cafile=/etc/kubernetes/ssl/ca.pem \
5  --etcd-certfile=/etc/kubernetes/ssl/etcd.pem \
6  --etcd-keyfile=/etc/kubernetes/ssl/etcd-key.pem \
7  --etcd-servers=https://10.199.159.36:2379,https://10.199.159.37:2379,https://10.199.159.38:2379 \
8...略

文章参照:

遇事不决参考官方文档:灾难恢复