服务器列表:
节点名 | ip | 操作系统 | 安装软件 | 备注 |
---|---|---|---|---|
pg_node1 | 192.168.210.15 | centos 7.6 | postgresql 13.3/patroni 2.0.2/etcd 3.5.0 | 初始主节点 |
pg_node2 | 192.168.210.81 | centos 7.6 | postgresql 13.3/patroni 2.0.2/etcd 3.5.0 | 初始备节点 |
pg_node3 | 192.168.210.33 | centos 7.6 | postgresql 13.3/patroni 2.0.2/etcd 3.5.0 | 初始备节点 |
主节点vip实现漂移:192.168.210.66
主机配置(所有节点)
修改主机名
按照以上表格说明修改对应主机名
#pg_node2、pg_node3对应修改
hostnamectl set-hostname "pg_node1"
关闭selinux
sed -i 's/selinux=.*/selinux=disabled/g' /etc/selinux/config
配置防火墙
防火墙需要开放postgres,etcd和patroni的端口。
- postgres:5432
- patroni:8008
- etcd:2379/2380
firewall-cmd --add-port=5432/tcp --permanent firewall-cmd --add-port=8008/tcp --permanent firewall-cmd --add-port=2379/tcp --permanent firewall-cmd --add-port=2380/tcp --permanent firewall-cmd --reload firewall-cmd --list-all
配置主机时区
timedatectl set-timezone asia/shanghai
配置主机同步时间
yum -y install chrony
sed '/^server/d' /etc/chrony.conf
echo 'server s1a.time.edu.cn iburst' >> /etc/chrony.conf
systemctl start chronyd
systemctl enable chronyd
安装需要的包
yum -y install gcc epel-release wget readline* zlib* bzip2 gcc-c openssl-devel python-pip python-psycopg2 python-devel lrzsz jq
重启服务器
reboot
创建安装用户
groupadd -g 5432 postgres
useradd -u 5432 -g postgres postgres; echo 'test123456' | passwd -f --stdin postgres
安装etcd
wget https://github.com/coreos/etcd/releases/download/v3.5.0/etcd-v3.5.0-linux-amd64.tar.gz
tar -zxvf etcd-v3.5.0-linux-amd64.tar.gz -c /opt/
cd /opt
mv etcd-v3.5.0-linux-amd64 etcd-v3.5.0
mkdir /etc/etcd
chown -r postgres:postgres /opt/etcd-v3.5.0 /etc/etcd
su - postgres
#pg_node1添加etcd配置
cat >> /etc/etcd/conf.yml <#pg_node2添加etcd配置
cat >> /etc/etcd/conf.yml <#pg_node3添加etcd配置
cat >> /etc/etcd/conf.yml <#添加环境变量
echo 'export etcdctl_api=3' >> /etc/profile
echo 'export patronictl_config_file=/etc/patroni/patroni.yml' >> /etc/profile
echo 'path=/opt/postgresql/13/bin:/opt/etcd-v3.5.0:$path' >> /etc/profile
source /etc/profile
#配置启动文件
su - root
cat >> /usr/lib/systemd/system/etcd.service <enable etcd
systemctl start etcd
systemctl restart etcd
#查看集群成员信息
[root@pg_node1 data]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| endpoint | id | version | db size | is leader | is learner | raft term | raft index | raft applied index | errors |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| 192.168.210.15:2379 | 2baa1a77ec379977 | 3.5.0 | 1.0 mb | false | false | 15 | 4863 | 4863 | |
| 192.168.210.81:2379 | ff5595d67d21105 | 3.5.0 | 1.0 mb | true | false | 15 | 4863 | 4863 | |
| 192.168.210.33:2379 | b5d9c4826815356e | 3.5.0 | 1.0 mb | false | false | 15 | 4863 | 4863 | |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
[root@pg_node1 data]# etcdctl endpoint health --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
--------------------- -------- ------------ -------
| endpoint | health | took | error |
--------------------- -------- ------------ -------
| 192.168.210.81:2379 | true | 7.232213ms | |
| 192.168.210.33:2379 | true | 7.497868ms | |
| 192.168.210.15:2379 | true | 7.464251ms | |
--------------------- -------- ------------ -------
[root@pg_node1 data]# etcdctl member list -w table
------------------ --------- -------- ---------------------------- -------------------------------------------------- ------------
| id | status | name | peer addrs | client addrs | is learner |
------------------ --------- -------- ---------------------------- -------------------------------------------------- ------------
| ff5595d67d21105 | started | etcd-2 | http://192.168.210.81:2380 | http://127.0.0.1:2379,http://192.168.210.81:2379 | false |
| 2baa1a77ec379977 | started | etcd-1 | http://192.168.210.15:2380 | http://127.0.0.1:2379,http://192.168.210.15:2379 | false |
| b5d9c4826815356e | started | etcd-3 | http://192.168.210.33:2380 | http://127.0.0.1:2379,http://192.168.210.33:2379 | false |
------------------ --------- -------- ---------------------------- -------------------------------------------------- ------------
etcdctl --help
etcd配置项说明:
#etcd名称,自定义
name: etcd-1
#存放etcd数据的目录,自定义
data-dir: /opt/etcd-v3.5.0/data
#监听url,用户客户端和server进行通信
listen-client-urls: http://192.168.210.15:2379,http://127.0.0.1:2379
#告知客户端自身的url,tcp 2379端口用于监听客户端请求
advertise-client-urls: http://192.168.210.15:2379,http://127.0.0.1:2379
#监听url,用于和其他节点通信
listen-peer-urls: http://192.168.210.15:2380
#告知集群其他节点,端口2380用于集群通信
initial-advertise-peer-urls: http://192.168.210.15:2380
#定义了集群内所有成员
initial-cluster: etcd-1=http://192.168.210.15:2380,etcd-2=http://192.168.108.81:2380,etcd-3=http://192.168.108.33:2380
#集群id,唯一标识
initial-cluster-token: etcd-cluster-token
#集群状态,new为新创建集群,existing为已经存在的集群
initial-cluster-state: new
postgresql安装:
#下载源码包
wget https://ftp.postgresql.org/pub/source/v13.3/postgresql-13.3.tar.bz2
#解压源码包
tar xjvf postgresql-13.3.tar.bz2
#创建安装目录
mkdir -p -m 700 /opt/postgresql/13/data
chown -r postgres:postgres /opt/postgresql/13/data
#编译安装
cd postgresql-13.3
./configure --prefix=/opt/postgresql/13 --with-pgport=5432 --with-python --with-openssl
gmake -j 8 world
make install
make install-docs
make install-world
patroni安装:
curl https://bootstrap.pypa.io/pip/2.7/get-pip.py -o get-pip.py
python get-pip.py
pip install --upgrade pip
pip install --upgrade setuptools
pip install --ignore-installed psycopg2
pip install psycopg2-binary
#验证python -c "import psycopg2; print(psycopg2.__version__)"
pip install patroni[etcd]
配置sudo权限:
cat >> /etc/sudoers <
创建patroni服务:
cat >> /usr/lib/systemd/system/patroni.service <# 使用watchdog进行服务监控
execstartpre=-/usr/bin/sudo /sbin/modprobe softdog
# 使用postgres用户管理,需要sudo
execstartpre=-/usr/bin/sudo /bin/chown postgres /dev/watchdog
# 注意纠正patroni命令的路径
execstart=/usr/bin/patroni /etc/patroni/patroni.yml
execreload=/bin/kill -s hup \$mainpid
killmode=process
timeoutsec=30
restart=no
[install]
wantedby=multi-user.target
eof
#重新加载systemd服务
systemctl daemon-reload
安装watchdog
使用watchdog为防止出现脑裂,如果leader节点异常导致patroni进程无法及时更新watchdog,会在leader key过期的前5秒触发重启。重启如果在5秒之内完成,leader节点有机会再次获得leader锁,否则leader key过期后,由备库通过选举选出新的leader。patroni会在将postgresql提升为master之前尝试激活watchdog。如果看watchdog激活失败并且watchdog模式是required那么节点将拒绝成为主节点。在决定参加领导者选举时,patroni还将检查watchdog配置是否允许它成为领导者。在将postgresql降级后(例如由于手动故障转移),patroni将再次禁用watchdog。当 patroni处于暂停状态时,watchdog也将被禁用。正常停止patroni服务,也会将watchdog禁用。
# 安装软件,linux内置功能
yum install -y watchdog
# 初始化watchdog字符设备
modprobe softdog
# 修改/dev/watchdog设备权限
chmod 666 /dev/watchdog
# 启动watchdog服务
systemctl start watchdog
systemctl enable watchdog
配置patroni
su - postgres
sudo mkdir -p /etc/patroni
sudo chown -r postgres:postgres /etc/patroni
# pg_node1
# 配置文件路径在前面systemd服务中已经定义
cat >> /etc/patroni/patroni.yml <#这个会配置到pg的cluster_name参数中
namespace: /service/ # etcd中键值位置,例如:/service/twpg/
name: pg1 # patroni名称,每个节点不一样
restapi:
listen: 0.0.0.0:8008 # 保持默认,监听所有的8008端口
connect_address: 192.168.210.15:8008 # 本地连接通信
etcd3: # 这里建议使用etcdv3,默认是etcdv2,默认写入到etcd中的键值都是不可见的(仅patroni如此)
hosts: 192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379 # etcd地址,如果这里使用单节点的etcd,需要将hosts关键字替换为host
log:
dir: /etc/patroni
file_size: 50000000
file_num: 10
dateformat: '%y-%m-%d %h:%m:%s'
loggers:
patroni.postmaster: warning
#etcd.client: debug
#urllib3: debug
#patroni的引导程序,patroni集群初始化的时候,就会把信息写入到etcd中的/namespace/scope/config下面
bootstrap:
dcs:
ttl: 30 #领导者的密钥的过期时间,也就是主库出现问题,故障转移的时间
loop_wait: 10 #循环更新领导者密钥过程中的间隔时间
retry_timeout: 10 #etcd和postgresql操作重试的超时时间(以秒为单位),任何小于此值的超时都不会导致领导者降级,例如:网络出现问题后,保证整体集群不进行故障转移的保留时间
maximum_lag_on_failover: 1048576 #如果master和replicate之间的字节数延迟大于此值,那么replicate将不参与新的领导者选举。
master_start_timeout: 300 #在触发故障转移之前允许主服务器从故障中恢复的时间
synchronous_mode: false # 异步复制
postgresql: # 以下是pgsql服务的特性即参数配置,不详述
use_pg_rewind: true
use_slots: true
parameters:
listen_addresses: "0.0.0.0"
port: 5432
wal_level: replica
hot_standby: "on"
wal_keep_segments: 256
max_wal_senders: 10
max_replication_slots: 10
wal_log_hints: "on"
logging_collector: "on"
#archive_mode: "on"
#archive_timeout: 1800s
#archive_command: test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f
#recovery_conf:
#restore_command: cp /mnt/server/archivedir/%f %p
initdb:
- encoding: utf8
- locale: c
- lc-ctype: zh_cn.utf-8
- data-checksums
pg_hba: # 定义流复制用户和远程连接身份鉴别设置
- host replication repuser 192.168.210.0/24 md5
- host all all 192.168.0.0/16 md5
postgresql:
listen: 0.0.0.0:5432
connect_address: 192.168.210.15:5432 # 连接pgsql服务的配置,这里不能使用127.0.0.1,pg_basebackup需要远程连接主库进行在线复制
data_dir: /opt/postgresql/13/data # \$pgdata
bin_dir: /opt/postgresql/13/bin # \$pghome/bin
authentication:
replication:
username: repuser
password: "test@123456"
superuser:
username: twsm
password: "test@123456"
rewind:
username: twsm
password: "test@123456"
basebackup:
#max-rate: 100m
checkpoint: fast
callbacks: # 本次配置没有使用haproxy keepalived实现vip切换和负载均衡,因为callbacks方式更快速,对系统资源消耗更小,操作更简单,脚本后面提供
on_start: /bin/bash /etc/patroni/patroni_callback.sh # patroni服务启动时候的触发的操作
on_stop: /bin/bash /etc/patroni/patroni_callback.sh # patroni服务停止时候触发的操作
on_role_change: /bin/bash /etc/patroni/patroni_callback.sh # patroni服务角色切换时触发的操作
watchdog: # 使用linux自带的软件watchdog监控patroni的服务持续性
mode: automatic # allowed values: off, automatic, required
device: /dev/watchdog # watchdog设备,/dev/watchdog和/dev/watchdog0等同,可能存在兼容性区别
safety_margin: 5
##safety_margin指如果patroni没有及时更新watchdog,watchdog会在leader key过期前多久触发重启。在本例的配置下(ttl=30,loop_wait=10,safety_margin=5)下,patroni进程每隔10秒(loop_wait)都会更新leader key和watchdog。
tags: #标签的设置,如果集群包含异地的数据中心,可以根据需要配置该节点为不参与选主,不参与负载均衡,也不作为同步备库。
nofailover: false # 一般用于异地是否执行自动切换
noloadbalance: false # 一般用于异是否开启负载均衡
clonefrom: false
nosync: false # 一般用于异是否开启负载均衡
eof
# pg_node2
# 修改
name: pg2
restapi:
listen: 0.0.0.0:8008
connect_address: 192.168.210.81:8008
postgresql:
listen: 0.0.0.0:5432
connect_address: 192.168.210.81:5432
# pg_node3
# 修改
name: pg3
restapi:
listen: 0.0.0.0:8008
connect_address: 192.168.210.33:8008
postgresql:
listen: 0.0.0.0:5432
connect_address: 192.168.210.33:5432
创建patroni_callback脚本
# 脚本的开头传入了三个变量,但是在patroni.yml文件中我们并没有传入任何的变量,实际测试过程中发现由patroni服务默认传入三个变量
# $1 - action, patroni触发的动作,stop/start/on_role_change/restart/reload
# $2 - role, 当前节点的角色,master/{slave|replica}
# $3 - scope, 作用范围,twpg服务
# 脚本来自于其他博客,由于逻辑很简单,直接引用了
# 当节点角色为主,使用ip addr命令绑定vip地址
# 当节点角色为备,使用ip addr命令解绑vip地址
cat >> /etc/patroni/patroni_callback.sh <#!/bin/bash
readonly action=\$1
readonly role=\$2
readonly scope=\$3
vip=192.168.210.66
dev=eth0
function usage() {
echo "usage: \$0 "
exit 1
}
echo "this is patroni callback \$action \$role \$scope"
case \$action in
on_stop)
sudo ip addr del \${vip}/24 dev \$dev 2>/dev/null
;;
on_start)
;;
on_role_change)
if [[ \$role == 'master' ]]; then
# 绑定vip
sudo ip addr add \${vip}/24 dev \$dev 2>/dev/null
# 监测vip冲突,并屏蔽冲突的ip
sudo arping -q -a -c 1 -i \$dev \$vip
else
sudo ip addr del \${vip}/24 dev \$dev 2>/dev/null
fi
;;
*)
usage
;;
esac
eof
chmod u x /etc/patroni/patroni_callback.sh
chown postgres:postgres /etc/patroni/patroni_callback.sh
启动patroni服务
#初次启动patroni时会自动进行数据库的初始化和备库的创建
systemctl enable patroni
systemctl start patroni
[root@pg_node1 data]# patronictl -c /etc/patroni/patroni.yml list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | leader | running | 3 | |
| pg2 | 192.168.210.81:5432 | replica | running | 3 | 0.0 |
| pg3 | 192.168.210.33:5432 | replica | running | 3 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
运维
#查询所有key
[root@pg_node1 data]# etcdctl --endpoints='192.168.210.15:2379' get / --prefix --keys-only
/service/twpg/config
/service/twpg/history
/service/twpg/initialize
/service/twpg/leader
/service/twpg/members/pg1
/service/twpg/members/pg2
/service/twpg/members/pg3
/service/twpg/optime/leader
--查询具体key
[root@pg_node1 data]# etcdctl --endpoints='192.168.210.15:2379' get /service/twpg/config
#通过restapi查询patroni信息(主节点)
[root@pg_node1 data]# curl -s http://192.168.210.15:8008/patroni | jq
{
"database_system_identifier": "6976142033405049133",
"postmaster_start_time": "2021-06-21 15:33:22.073 cst",
"timeline": 3,
"cluster_unlocked": false,
"patroni": {
"scope": "twpg",
"version": "2.0.2"
},
"replication": [
{
"sync_state": "async",
"sync_priority": 0,
"client_addr": "192.168.210.81",
"state": "streaming",
"application_name": "pg2",
"usename": "repuser"
},
{
"sync_state": "async",
"sync_priority": 0,
"client_addr": "192.168.210.33",
"state": "streaming",
"application_name": "pg3",
"usename": "repuser"
}
],
"state": "running",
"role": "master",
"xlog": {
"location": 83886408
},
"server_version": 130003
}
#通过restapi查询patroni信息(备节点)
[root@pg_node1 data]# curl -s http://192.168.210.81:8008/patroni | jq
{
"database_system_identifier": "6976142033405049133",
"postmaster_start_time": "2021-06-21 15:37:42.746 cst",
"timeline": 3,
"cluster_unlocked": false,
"patroni": {
"scope": "twpg",
"version": "2.0.2"
},
"state": "running",
"role": "replica",
"xlog": {
"received_location": 83886408,
"replayed_timestamp": null,
"paused": false,
"replayed_location": 83886408
},
"server_version": 130003
}
#模拟etcd节点出现问题,停掉etcd的leader节点
[root@pg_node1 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| endpoint | id | version | db size | is leader | is learner | raft term | raft index | raft applied index | errors |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| 192.168.210.15:2379 | 2baa1a77ec379977 | 3.5.0 | 1.0 mb | false | false | 15 | 4866 | 4866 | |
| 192.168.210.81:2379 | ff5595d67d21105 | 3.5.0 | 1.0 mb | true | false | 15 | 4866 | 4866 | |
| 192.168.210.33:2379 | b5d9c4826815356e | 3.5.0 | 1.0 mb | false | false | 15 | 4866 | 4866 | |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
[root@pg_node2 ~]# systemctl stop etcd
[root@pg_node2 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
{"level":"warn","ts":"2021-06-23t09:20:00.826 0800","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003fea80/#initially=[192.168.210.15:2379;192.168.210.81:2379;192.168.210.33:2379]","attempt":0,"error":"rpc error: code = deadlineexceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: error while dialing dial tcp 192.168.210.81:2379: connect: connection refused\""}
failed to get the status of endpoint 192.168.210.81:2379 (context deadline exceeded)
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| endpoint | id | version | db size | is leader | is learner | raft term | raft index | raft applied index | errors |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| 192.168.210.15:2379 | 2baa1a77ec379977 | 3.5.0 | 1.0 mb | false | false | 16 | 4867 | 4867 | |
| 192.168.210.33:2379 | b5d9c4826815356e | 3.5.0 | 1.0 mb | true | false | 16 | 4867 | 4867 | |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
[root@pg_node1 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
{"level":"warn","ts":"2021-06-23t09:16:37.350 0800","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00037c700/#initially=[192.168.210.15:2379;192.168.210.81:2379;192.168.210.33:2379]","attempt":0,"error":"rpc error: code = deadlineexceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: error while dialing dial tcp 192.168.210.81:2379: connect: connection refused\""}
failed to get the status of endpoint 192.168.210.81:2379 (context deadline exceeded)
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| endpoint | id | version | db size | is leader | is learner | raft term | raft index | raft applied index | errors |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| 192.168.210.15:2379 | 2baa1a77ec379977 | 3.5.0 | 1.0 mb | false | false | 16 | 4867 | 4867 | |
| 192.168.210.33:2379 | b5d9c4826815356e | 3.5.0 | 1.0 mb | true | false | 16 | 4867 | 4867 | |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
[root@pg_node3 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
{"level":"warn","ts":"2021-06-23t09:20:39.338 0800","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00013a380/#initially=[192.168.210.15:2379;192.168.210.81:2379;192.168.210.33:2379]","attempt":0,"error":"rpc error: code = deadlineexceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: error while dialing dial tcp 192.168.210.81:2379: connect: connection refused\""}
failed to get the status of endpoint 192.168.210.81:2379 (context deadline exceeded)
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| endpoint | id | version | db size | is leader | is learner | raft term | raft index | raft applied index | errors |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| 192.168.210.15:2379 | 2baa1a77ec379977 | 3.5.0 | 1.0 mb | false | false | 16 | 4867 | 4867 | |
| 192.168.210.33:2379 | b5d9c4826815356e | 3.5.0 | 1.0 mb | true | false | 16 | 4867 | 4867 | |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
#从以上信息可以看到,etcd已重新选举了leader,pg_node2节点已拒绝连接
#下面再次启动pg_node2节点的etcd,节点加入集群
[root@pg_node2 ~]# systemctl start etcd
[root@pg_node2 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| endpoint | id | version | db size | is leader | is learner | raft term | raft index | raft applied index | errors |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| 192.168.210.15:2379 | 2baa1a77ec379977 | 3.5.0 | 1.0 mb | false | false | 16 | 4868 | 4868 | |
| 192.168.210.81:2379 | ff5595d67d21105 | 3.5.0 | 1.0 mb | false | false | 16 | 4868 | 4868 | |
| 192.168.210.33:2379 | b5d9c4826815356e | 3.5.0 | 1.0 mb | true | false | 16 | 4868 | 4868 | |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
[root@pg_node1 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| endpoint | id | version | db size | is leader | is learner | raft term | raft index | raft applied index | errors |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| 192.168.210.15:2379 | 2baa1a77ec379977 | 3.5.0 | 1.0 mb | false | false | 16 | 4868 | 4868 | |
| 192.168.210.81:2379 | ff5595d67d21105 | 3.5.0 | 1.0 mb | false | false | 16 | 4868 | 4868 | |
| 192.168.210.33:2379 | b5d9c4826815356e | 3.5.0 | 1.0 mb | true | false | 16 | 4868 | 4868 | |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
[root@pg_node3 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| endpoint | id | version | db size | is leader | is learner | raft term | raft index | raft applied index | errors |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| 192.168.210.15:2379 | 2baa1a77ec379977 | 3.5.0 | 1.0 mb | false | false | 16 | 4868 | 4868 | |
| 192.168.210.81:2379 | ff5595d67d21105 | 3.5.0 | 1.0 mb | false | false | 16 | 4868 | 4868 | |
| 192.168.210.33:2379 | b5d9c4826815356e | 3.5.0 | 1.0 mb | true | false | 16 | 4868 | 4868 | |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
#模拟etcd节点出现问题,停掉任意2个etcd节点
[root@pg_node2 ~]# systemctl stop etcd
[root@pg_node2 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
{"level":"warn","ts":"2021-06-23t09:20:00.826 0800","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003fea80/#initially=[192.168.210.15:2379;192.168.210.81:2379;192.168.210.33:2379]","attempt":0,"error":"rpc error: code = deadlineexceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: error while dialing dial tcp 192.168.210.81:2379: connect: connection refused\""}
failed to get the status of endpoint 192.168.210.81:2379 (context deadline exceeded)
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| endpoint | id | version | db size | is leader | is learner | raft term | raft index | raft applied index | errors |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| 192.168.210.15:2379 | 2baa1a77ec379977 | 3.5.0 | 1.0 mb | false | false | 16 | 4867 | 4867 | |
| 192.168.210.33:2379 | b5d9c4826815356e | 3.5.0 | 1.0 mb | true | false | 16 | 4867 | 4867 | |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
[root@pg_node3 ~]# systemctl stop etcd
[root@pg_node3 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
{"level":"warn","ts":"2021-06-23t09:27:22.438 0800","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000150000/#initially=[192.168.210.15:2379;192.168.210.81:2379;192.168.210.33:2379]","attempt":0,"error":"rpc error: code = deadlineexceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: error while dialing dial tcp 192.168.210.81:2379: connect: connection refused\""}
failed to get the status of endpoint 192.168.210.81:2379 (context deadline exceeded)
{"level":"warn","ts":"2021-06-23t09:27:27.439 0800","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000150000/#initially=[192.168.210.15:2379;192.168.210.81:2379;192.168.210.33:2379]","attempt":0,"error":"rpc error: code = deadlineexceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: error while dialing dial tcp 192.168.210.33:2379: connect: connection refused\""}
failed to get the status of endpoint 192.168.210.33:2379 (context deadline exceeded)
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- -----------------------
| endpoint | id | version | db size | is leader | is learner | raft term | raft index | raft applied index | errors |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- -----------------------
| 192.168.210.15:2379 | 2baa1a77ec379977 | 3.5.0 | 1.0 mb | false | false | 17 | 4869 | 4869 | etcdserver: no leader |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- -----------------------
#从以上信息可以看到,停止3个中任意两个节点,etcd集群不再可用
#启动后集群恢复正常,健壮性非常强
[root@pg_node3 ~]# systemctl start etcd
[root@pg_node3 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
{"level":"warn","ts":"2021-06-23t09:28:31.642 0800","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000150a80/#initially=[192.168.210.15:2379;192.168.210.81:2379;192.168.210.33:2379]","attempt":0,"error":"rpc error: code = deadlineexceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: error while dialing dial tcp 192.168.210.81:2379: connect: connection refused\""}
failed to get the status of endpoint 192.168.210.81:2379 (context deadline exceeded)
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| endpoint | id | version | db size | is leader | is learner | raft term | raft index | raft applied index | errors |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| 192.168.210.15:2379 | 2baa1a77ec379977 | 3.5.0 | 1.0 mb | true | false | 18 | 4883 | 4883 | |
| 192.168.210.33:2379 | b5d9c4826815356e | 3.5.0 | 1.0 mb | false | false | 18 | 4883 | 4883 | |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
[root@pg_node2 ~]# systemctl start etcd
[root@pg_node2 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| endpoint | id | version | db size | is leader | is learner | raft term | raft index | raft applied index | errors |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| 192.168.210.15:2379 | 2baa1a77ec379977 | 3.5.0 | 1.0 mb | true | false | 18 | 4888 | 4888 | |
| 192.168.210.81:2379 | ff5595d67d21105 | 3.5.0 | 1.0 mb | false | false | 18 | 4888 | 4888 | |
| 192.168.210.33:2379 | b5d9c4826815356e | 3.5.0 | 1.0 mb | false | false | 18 | 4888 | 4888 | |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
[root@pg_node1 ~]# etcdctl endpoint status --endpoints='192.168.210.15:2379,192.168.210.81:2379,192.168.210.33:2379' -w table
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| endpoint | id | version | db size | is leader | is learner | raft term | raft index | raft applied index | errors |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
| 192.168.210.15:2379 | 2baa1a77ec379977 | 3.5.0 | 1.0 mb | true | false | 19 | 4905 | 4905 | |
| 192.168.210.81:2379 | ff5595d67d21105 | 3.5.0 | 1.0 mb | false | false | 19 | 4905 | 4905 | |
| 192.168.210.33:2379 | b5d9c4826815356e | 3.5.0 | 1.0 mb | false | false | 19 | 4905 | 4905 | |
--------------------- ------------------ --------- --------- ----------- ------------ ----------- ------------ -------------------- --------
#模拟postgresql数据库出现问题
#停止主库服务
[postgres@pg_node1 ~]$ pg_ctl stop -d /opt/postgresql/13/data
waiting for server to shut down.... done
server stopped
[postgres@pg_node1 ~]$ ps -ef|grep postgres
postgres 4447 4367 0 09:37 pts/0 00:00:00 grep --color=auto postgres
[postgres@pg_node1 ~]$ ps -ef|grep postgres
postgres 4451 4367 0 09:37 pts/0 00:00:00 grep --color=auto postgres
[postgres@pg_node1 ~]$ ps -ef|grep postgres
postgres 4453 4367 0 09:37 pts/0 00:00:00 grep --color=auto postgres
[postgres@pg_node1 ~]$ ps -ef|grep postgres
postgres 4459 1 1 09:37 ? 00:00:00 /opt/postgresql/13/bin/postgres -d /opt/postgresql/13/data --config-file=/opt/postgresql/13/data/postgresql.conf --listen_addresses=0.0.0.0 --max_worker_processes=8 --max_prepared_transactions=0 --wal_level=replica --track_commit_timestamp=off --max_locks_per_transaction=64 --port=5432 --max_replication_slots=10 --max_connections=100 --hot_standby=on --cluster_name=twpg --wal_log_hints=on --max_wal_senders=10
postgres 4462 4459 0 09:37 ? 00:00:00 postgres: twpg: checkpointer
postgres 4463 4459 0 09:37 ? 00:00:00 postgres: twpg: background writer
postgres 4464 4459 0 09:37 ? 00:00:00 postgres: twpg: stats collector
postgres 4472 4459 0 09:37 ? 00:00:00 postgres: twpg: twsm postgres 127.0.0.1(40298) idle
postgres 4487 4459 0 09:37 ? 00:00:00 postgres: twpg: walwriter
postgres 4488 4459 0 09:37 ? 00:00:00 postgres: twpg: autovacuum launcher
postgres 4489 4459 0 09:37 ? 00:00:00 postgres: twpg: logical replication launcher
postgres 4490 4459 0 09:37 ? 00:00:00 postgres: twpg: walsender repuser 192.168.210.33(52430) streaming 0/5000668
postgres 4491 4459 0 09:37 ? 00:00:00 postgres: twpg: walsender repuser 192.168.210.81(62184) streaming 0/5000668
#postgresql服务被patroni自动拉起,未发生故障转移
[root@pg_node1 log]# patronictl -c /etc/patroni/patroni.yml list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | leader | running | 6 | |
| pg2 | 192.168.210.81:5432 | replica | running | 6 | 0.0 |
| pg3 | 192.168.210.33:5432 | replica | running | 6 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
#查看下刚刚停止pg服务后的patroni日志,patroni检测到pg服务关闭,会尝试把pg服务启动
jun 23 09:37:45 pg_node1 patroni: 2021-06-23 09:37:45,395 info: no action. i am the leader with the lock
jun 23 09:37:47 pg_node1 patroni: 2021-06-23 09:37:47.040 cst [4012] log: received fast shutdown request
jun 23 09:37:47 pg_node1 patroni: 2021-06-23 09:37:47.047 cst [4012] log: aborting any active transactions
jun 23 09:37:47 pg_node1 patroni: 2021-06-23 09:37:47.047 cst [4056] fatal: terminating connection due to administrator command
jun 23 09:37:47 pg_node1 patroni: 2021-06-23 09:37:47.048 cst [4012] log: background worker "logical replication launcher" (pid 4258) exited with exit code 1
jun 23 09:37:47 pg_node1 patroni: 2021-06-23 09:37:47.049 cst [4015] log: shutting down
jun 23 09:37:47 pg_node1 patroni: 2021-06-23 09:37:47.122 cst [4012] log: database system is shut down
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55,389 warning: postgresql is not running.
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55,390 info: lock owner: pg1; i am pg1
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55,404 info: pg_controldata:
jun 23 09:37:55 pg_node1 patroni: database system identifier: 6976142033405049133
jun 23 09:37:55 pg_node1 patroni: pg_control last modified: wed jun 23 09:37:47 2021
jun 23 09:37:55 pg_node1 patroni: blocks per segment of large relation: 131072
jun 23 09:37:55 pg_node1 patroni: size of a large-object chunk: 2048
jun 23 09:37:55 pg_node1 patroni: wal block size: 8192
jun 23 09:37:55 pg_node1 patroni: latest checkpoint's oldestactivexid: 0
jun 23 09:37:55 pg_node1 patroni: latest checkpoint's timelineid: 5
jun 23 09:37:55 pg_node1 patroni: bytes per wal segment: 16777216
jun 23 09:37:55 pg_node1 patroni: fake lsn counter for unlogged rels: 0/3e8
jun 23 09:37:55 pg_node1 patroni: max_connections setting: 100
jun 23 09:37:55 pg_node1 patroni: latest checkpoint location: 0/5000510
jun 23 09:37:55 pg_node1 patroni: float8 argument passing: by value
jun 23 09:37:55 pg_node1 patroni: minimum recovery ending location: 0/0
jun 23 09:37:55 pg_node1 patroni: track_commit_timestamp setting: off
jun 23 09:37:55 pg_node1 patroni: latest checkpoint's newestcommittsxid: 0
jun 23 09:37:55 pg_node1 patroni: latest checkpoint's nextmultixactid: 1
jun 23 09:37:55 pg_node1 patroni: maximum size of a toast chunk: 1996
jun 23 09:37:55 pg_node1 patroni: maximum data alignment: 8
jun 23 09:37:55 pg_node1 patroni: date/time type storage: 64-bit integers
jun 23 09:37:55 pg_node1 patroni: database block size: 8192
jun 23 09:37:55 pg_node1 patroni: data page checksum version: 1
jun 23 09:37:55 pg_node1 patroni: time of latest checkpoint: wed jun 23 09:37:47 2021
jun 23 09:37:55 pg_node1 patroni: wal_log_hints setting: on
jun 23 09:37:55 pg_node1 patroni: latest checkpoint's full_page_writes: on
jun 23 09:37:55 pg_node1 patroni: end-of-backup record required: no
jun 23 09:37:55 pg_node1 patroni: max_prepared_xacts setting: 0
jun 23 09:37:55 pg_node1 patroni: latest checkpoint's nextmultioffset: 0
jun 23 09:37:55 pg_node1 patroni: backup start location: 0/0
jun 23 09:37:55 pg_node1 patroni: latest checkpoint's oldestmultixid: 1
jun 23 09:37:55 pg_node1 patroni: mock authentication nonce: 020ac2d0808d3cf471a8d90e23263e8a317b31c5f7e494fc3fba551683e5c39b
jun 23 09:37:55 pg_node1 patroni: latest checkpoint's nextoid: 16385
jun 23 09:37:55 pg_node1 patroni: maximum columns in an index: 32
jun 23 09:37:55 pg_node1 patroni: latest checkpoint's oldestxid: 479
jun 23 09:37:55 pg_node1 patroni: catalog version number: 202007201
jun 23 09:37:55 pg_node1 patroni: max_worker_processes setting: 8
jun 23 09:37:55 pg_node1 patroni: maximum length of identifiers: 64
jun 23 09:37:55 pg_node1 patroni: min recovery ending loc's timeline: 0
jun 23 09:37:55 pg_node1 patroni: max_locks_per_xact setting: 64
jun 23 09:37:55 pg_node1 patroni: max_wal_senders setting: 10
jun 23 09:37:55 pg_node1 patroni: latest checkpoint's nextxid: 0:488
jun 23 09:37:55 pg_node1 patroni: latest checkpoint's redo location: 0/5000510
jun 23 09:37:55 pg_node1 patroni: backup end location: 0/0
jun 23 09:37:55 pg_node1 patroni: database cluster state: shut down
jun 23 09:37:55 pg_node1 patroni: pg_control version number: 1300
jun 23 09:37:55 pg_node1 patroni: wal_level setting: replica
jun 23 09:37:55 pg_node1 patroni: latest checkpoint's redo wal file: 000000050000000000000005
jun 23 09:37:55 pg_node1 patroni: latest checkpoint's oldestcommittsxid: 0
jun 23 09:37:55 pg_node1 patroni: latest checkpoint's oldestxid's db: 1
jun 23 09:37:55 pg_node1 patroni: latest checkpoint's oldestmulti's db: 1
jun 23 09:37:55 pg_node1 patroni: latest checkpoint's prevtimelineid: 5
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55,406 info: lock owner: pg1; i am pg1
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55,406 info: lock owner: pg1; i am pg1
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55,418 info: starting as readonly because i had the session lock
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55,420 info: closed patroni connection to the postgresql cluster
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55,452 info: postmaster pid=4459
jun 23 09:37:55 pg_node1 patroni: localhost:5432 - no response
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.485 cst [4459] log: starting postgresql 13.3 on x86_64-pc-linux-gnu, compiled by gcc (gcc) 4.8.5 20150623 (red hat 4.8.5-44), 64-bit
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.485 cst [4459] log: listening on ipv4 address "0.0.0.0", port 5432
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.508 cst [4459] log: listening on unix socket "/tmp/.s.pgsql.5432"
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.524 cst [4461] log: database system was shut down at 2021-06-23 09:37:47 cst
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.525 cst [4461] warning: specified neither primary_conninfo nor restore_command
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.525 cst [4461] hint: the database server will regularly poll the pg_wal subdirectory to check for files placed there.
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.525 cst [4461] log: entering standby mode
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.533 cst [4461] log: consistent recovery state reached at 0/5000588
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.534 cst [4461] log: invalid record length at 0/5000588: wanted 24, got 0
jun 23 09:37:55 pg_node1 patroni: 2021-06-23 09:37:55.535 cst [4459] log: database system is ready to accept read only connections
jun 23 09:37:56 pg_node1 patroni: localhost:5432 - accepting connections
jun 23 09:37:56 pg_node1 patroni: localhost:5432 - accepting connections
jun 23 09:37:56 pg_node1 patroni: this is patroni callback on_role_change replica twpg
jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56,544 info: lock owner: pg1; i am pg1
jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56,545 info: establishing a new patroni connection to the postgres cluster
jun 23 09:37:56 pg_node1 systemd: started session c31 of user root.
jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56,570 info: software watchdog activated with 25 second timeout, timing slack 15 seconds
jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56,584 info: promoted self to leader because i had the session lock
jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56,586 info: lock owner: pg1; i am pg1
jun 23 09:37:56 pg_node1 patroni: server promoting
jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56.593 cst [4461] log: received promote request
jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56.593 cst [4461] log: redo is not required
jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56,593 info: cleared rewind state after becoming the leader
jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56.599 cst [4461] log: selected new timeline id: 6
jun 23 09:37:56 pg_node1 patroni: this is patroni callback on_role_change master twpg
jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56,618 info: updated leader lock during promote
jun 23 09:37:56 pg_node1 systemd: started session c32 of user root.
jun 23 09:37:56 pg_node1 systemd: started session c33 of user root.
jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56.880 cst [4461] log: archive recovery complete
jun 23 09:37:56 pg_node1 patroni: 2021-06-23 09:37:56.911 cst [4459] log: database system is ready to accept connections
jun 23 09:37:57 pg_node1 patroni: 2021-06-23 09:37:57,638 info: lock owner: pg1; i am pg1
jun 23 09:37:57 pg_node1 patroni: 2021-06-23 09:37:57,724 info: no action. i am the leader with the lock
#模拟postgresql数据库出现问题
#停止备库服务
[postgres@pg_node2 ~]$ pg_ctl stop -d /opt/postgresql/13/data
waiting for server to shut down.... done
server stopped
[postgres@pg_node2 ~]$ ps -ef|grep postgres
postgres 26277 26180 0 09:51 pts/0 00:00:00 grep --color=auto postgres
[postgres@pg_node2 ~]$ ps -ef|grep postgres
postgres 26284 1 2 09:51 ? 00:00:00 /opt/postgresql/13/bin/postgres -d /opt/postgresql/13/data --config-file=/opt/postgresql/13/data/postgresql.conf --listen_addresses=0.0.0.0 --max_worker_processes=8 --max_prepared_transactions=0 --wal_level=replica --track_commit_timestamp=off --max_locks_per_transaction=64 --port=5432 --max_replication_slots=10 --max_connections=100 --hot_standby=on --cluster_name=twpg --wal_log_hints=on --max_wal_senders=10
postgres 26286 26284 0 09:51 ? 00:00:00 postgres: twpg: startup recovering 000000060000000000000005
postgres 26287 26284 0 09:51 ? 00:00:00 postgres: twpg: checkpointer
postgres 26288 26284 0 09:51 ? 00:00:00 postgres: twpg: background writer
postgres 26289 26284 0 09:51 ? 00:00:00 postgres: twpg: stats collector
postgres 26290 26284 0 09:51 ? 00:00:00 postgres: twpg: walreceiver
postgres 26293 26180 0 09:51 pts/0 00:00:00 grep --color=auto postgres
#查看下刚刚停止pg服务后的patroni日志,patroni检测到pg服务关闭,会尝试把pg服务启动
jun 23 09:51:29 pg_node2 patroni: 2021-06-23 09:51:29,043 info: does not have lock
jun 23 09:51:29 pg_node2 patroni: 2021-06-23 09:51:29,045 info: no action. i am a secondary and i am following a leader
jun 23 09:51:37 pg_node2 patroni: 2021-06-23 09:51:37.498 cst [26228] log: received fast shutdown request
jun 23 09:51:37 pg_node2 patroni: 2021-06-23 09:51:37.505 cst [26228] log: aborting any active transactions
jun 23 09:51:37 pg_node2 patroni: 2021-06-23 09:51:37.505 cst [26234] fatal: terminating walreceiver process due to administrator command
jun 23 09:51:37 pg_node2 patroni: 2021-06-23 09:51:37.505 cst [26241] fatal: terminating connection due to administrator command
jun 23 09:51:37 pg_node2 patroni: 2021-06-23 09:51:37.507 cst [26231] log: shutting down
jun 23 09:51:37 pg_node2 patroni: 2021-06-23 09:51:37.522 cst [26228] log: database system is shut down
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,042 warning: postgresql is not running.
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,042 info: lock owner: pg1; i am pg2
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,053 info: pg_controldata:
jun 23 09:51:39 pg_node2 patroni: database system identifier: 6976142033405049133
jun 23 09:51:39 pg_node2 patroni: pg_control last modified: wed jun 23 09:51:37 2021
jun 23 09:51:39 pg_node2 patroni: blocks per segment of large relation: 131072
jun 23 09:51:39 pg_node2 patroni: size of a large-object chunk: 2048
jun 23 09:51:39 pg_node2 patroni: wal block size: 8192
jun 23 09:51:39 pg_node2 patroni: latest checkpoint's oldestactivexid: 488
jun 23 09:51:39 pg_node2 patroni: latest checkpoint's timelineid: 6
jun 23 09:51:39 pg_node2 patroni: bytes per wal segment: 16777216
jun 23 09:51:39 pg_node2 patroni: fake lsn counter for unlogged rels: 0/3e8
jun 23 09:51:39 pg_node2 patroni: max_connections setting: 100
jun 23 09:51:39 pg_node2 patroni: latest checkpoint location: 0/50006a0
jun 23 09:51:39 pg_node2 patroni: float8 argument passing: by value
jun 23 09:51:39 pg_node2 patroni: minimum recovery ending location: 0/5000750
jun 23 09:51:39 pg_node2 patroni: track_commit_timestamp setting: off
jun 23 09:51:39 pg_node2 patroni: latest checkpoint's newestcommittsxid: 0
jun 23 09:51:39 pg_node2 patroni: latest checkpoint's nextmultixactid: 1
jun 23 09:51:39 pg_node2 patroni: maximum size of a toast chunk: 1996
jun 23 09:51:39 pg_node2 patroni: maximum data alignment: 8
jun 23 09:51:39 pg_node2 patroni: date/time type storage: 64-bit integers
jun 23 09:51:39 pg_node2 patroni: database block size: 8192
jun 23 09:51:39 pg_node2 patroni: data page checksum version: 1
jun 23 09:51:39 pg_node2 patroni: time of latest checkpoint: wed jun 23 09:37:57 2021
jun 23 09:51:39 pg_node2 patroni: wal_log_hints setting: on
jun 23 09:51:39 pg_node2 patroni: latest checkpoint's full_page_writes: on
jun 23 09:51:39 pg_node2 patroni: end-of-backup record required: no
jun 23 09:51:39 pg_node2 patroni: max_prepared_xacts setting: 0
jun 23 09:51:39 pg_node2 patroni: latest checkpoint's nextmultioffset: 0
jun 23 09:51:39 pg_node2 patroni: backup start location: 0/0
jun 23 09:51:39 pg_node2 patroni: latest checkpoint's oldestmultixid: 1
jun 23 09:51:39 pg_node2 patroni: mock authentication nonce: 020ac2d0808d3cf471a8d90e23263e8a317b31c5f7e494fc3fba551683e5c39b
jun 23 09:51:39 pg_node2 patroni: latest checkpoint's nextoid: 16385
jun 23 09:51:39 pg_node2 patroni: maximum columns in an index: 32
jun 23 09:51:39 pg_node2 patroni: latest checkpoint's oldestxid: 479
jun 23 09:51:39 pg_node2 patroni: catalog version number: 202007201
jun 23 09:51:39 pg_node2 patroni: max_worker_processes setting: 8
jun 23 09:51:39 pg_node2 patroni: maximum length of identifiers: 64
jun 23 09:51:39 pg_node2 patroni: min recovery ending loc's timeline: 6
jun 23 09:51:39 pg_node2 patroni: max_locks_per_xact setting: 64
jun 23 09:51:39 pg_node2 patroni: max_wal_senders setting: 10
jun 23 09:51:39 pg_node2 patroni: latest checkpoint's nextxid: 0:488
jun 23 09:51:39 pg_node2 patroni: latest checkpoint's redo location: 0/5000668
jun 23 09:51:39 pg_node2 patroni: backup end location: 0/0
jun 23 09:51:39 pg_node2 patroni: database cluster state: shut down in recovery
jun 23 09:51:39 pg_node2 patroni: pg_control version number: 1300
jun 23 09:51:39 pg_node2 patroni: wal_level setting: replica
jun 23 09:51:39 pg_node2 patroni: latest checkpoint's redo wal file: 000000060000000000000005
jun 23 09:51:39 pg_node2 patroni: latest checkpoint's oldestcommittsxid: 0
jun 23 09:51:39 pg_node2 patroni: latest checkpoint's oldestxid's db: 1
jun 23 09:51:39 pg_node2 patroni: latest checkpoint's oldestmulti's db: 1
jun 23 09:51:39 pg_node2 patroni: latest checkpoint's prevtimelineid: 6
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,055 info: lock owner: pg1; i am pg2
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,081 info: local timeline=6 lsn=0/5000750
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,088 info: master_timeline=6
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,089 info: lock owner: pg1; i am pg2
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,105 info: starting as a secondary
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,107 info: closed patroni connection to the postgresql cluster
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39,130 info: postmaster pid=26284
jun 23 09:51:39 pg_node2 patroni: localhost:5432 - no response
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.172 cst [26284] log: starting postgresql 13.3 on x86_64-pc-linux-gnu, compiled by gcc (gcc) 4.8.5 20150623 (red hat 4.8.5-44), 64-bit
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.172 cst [26284] log: listening on ipv4 address "0.0.0.0", port 5432
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.186 cst [26284] log: listening on unix socket "/tmp/.s.pgsql.5432"
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.203 cst [26286] log: database system was shut down in recovery at 2021-06-23 09:51:37 cst
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.203 cst [26286] log: entering standby mode
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.212 cst [26286] log: redo starts at 0/5000668
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.212 cst [26286] log: consistent recovery state reached at 0/5000750
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.212 cst [26286] log: invalid record length at 0/5000750: wanted 24, got 0
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.213 cst [26284] log: database system is ready to accept read only connections
jun 23 09:51:39 pg_node2 patroni: 2021-06-23 09:51:39.229 cst [26290] log: started streaming wal from primary at 0/5000000 on timeline 6
jun 23 09:51:40 pg_node2 patroni: localhost:5432 - accepting connections
jun 23 09:51:40 pg_node2 patroni: localhost:5432 - accepting connections
jun 23 09:51:40 pg_node2 patroni: this is patroni callback on_start replica twpg
jun 23 09:51:40 pg_node2 patroni: 2021-06-23 09:51:40,229 info: lock owner: pg1; i am pg2
jun 23 09:51:40 pg_node2 patroni: 2021-06-23 09:51:40,229 info: does not have lock
jun 23 09:51:40 pg_node2 patroni: 2021-06-23 09:51:40,229 info: establishing a new patroni connection to the postgres cluster
jun 23 09:51:40 pg_node2 patroni: 2021-06-23 09:51:40,261 info: no action. i am a secondary and i am following a leader
jun 23 09:51:50 pg_node2 patroni: 2021-06-23 09:51:50,228 info: lock owner: pg1; i am pg2
jun 23 09:51:50 pg_node2 patroni: 2021-06-23 09:51:50,228 info: does not have lock
jun 23 09:51:50 pg_node2 patroni: 2021-06-23 09:51:50,238 info: no action. i am a secondary and i am following a leader
#模拟patroni出现问题
#停止pg主机所在节点的patroni服务
[root@pg_node1 ~]# patronictl -c /etc/patroni/patroni.yml list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | leader | running | 6 | |
| pg2 | 192.168.210.81:5432 | replica | running | 6 | 0.0 |
| pg3 | 192.168.210.33:5432 | replica | running | 6 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
[root@pg_node1 ~]# ip -o -4 a
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
2: eth0 inet 192.168.210.15/24 brd 192.168.210.255 scope global noprefixroute dynamic eth0\ valid_lft 63429sec preferred_lft 63429sec
2: eth0 inet 192.168.210.66/24 scope global secondary eth0\ valid_lft forever preferred_lft forever
[root@pg_node1 ~]# systemctl stop patroni
[root@pg_node1 ~]# ip -o -4 a
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
2: eth0 inet 192.168.210.15/24 brd 192.168.210.255 scope global noprefixroute dynamic eth0\ valid_lft 63357sec preferred_lft 63357sec
[root@pg_node2 ~]# patronictl -c /etc/patroni/patroni.yml list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | leader | running | 6 | |
| pg2 | 192.168.210.81:5432 | replica | running | 6 | 0.0 |
| pg3 | 192.168.210.33:5432 | replica | running | 6 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
[root@pg_node2 ~]# patronictl -c /etc/patroni/patroni.yml list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | replica | stopped | | unknown |
| pg2 | 192.168.210.81:5432 | replica | running | 7 | 0.0 |
| pg3 | 192.168.210.33:5432 | leader | running | 7 | |
-------- --------------------- --------- --------- ---- -----------
[root@pg_node2 ~]# patronictl -c /etc/patroni/patroni.yml list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg2 | 192.168.210.81:5432 | replica | running | 7 | 0.0 |
| pg3 | 192.168.210.33:5432 | leader | running | 7 | |
-------- --------------------- --------- --------- ---- -----------
[root@pg_node3 ~]# patronictl -c /etc/patroni/patroni.yml list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | leader | running | 6 | |
| pg2 | 192.168.210.81:5432 | replica | running | 6 | 0.0 |
| pg3 | 192.168.210.33:5432 | replica | running | 6 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
[root@pg_node3 ~]# patronictl -c /etc/patroni/patroni.yml list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg2 | 192.168.210.81:5432 | replica | running | 7 | 0.0 |
| pg3 | 192.168.210.33:5432 | leader | running | 7 | |
-------- --------------------- --------- --------- ---- -----------
[root@pg_node3 ~]# ip -o -4 a
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
2: eth0 inet 192.168.210.33/24 brd 192.168.210.255 scope global noprefixroute dynamic eth0\ valid_lft 69980sec preferred_lft 69980sec
2: eth0 inet 192.168.210.66/24 scope global secondary eth0\ valid_lft forever preferred_lft forever
#节点3日志信息,数据库提升为主库
jun 23 09:59:07 pg_node3 patroni: 2021-06-23 09:59:07,841 info: lock owner: pg1; i am pg3
jun 23 09:59:07 pg_node3 patroni: 2021-06-23 09:59:07,841 info: does not have lock
jun 23 09:59:07 pg_node3 patroni: 2021-06-23 09:59:07,849 info: no action. i am a secondary and i am following a leader
jun 23 09:59:12 pg_node3 patroni: 2021-06-23 09:59:12.171 cst [25279] log: replication terminated by primary server
jun 23 09:59:12 pg_node3 patroni: 2021-06-23 09:59:12.171 cst [25279] detail: end of wal reached on timeline 6 at 0/50007c8.
jun 23 09:59:12 pg_node3 patroni: 2021-06-23 09:59:12.171 cst [25279] fatal: could not send end-of-streaming message to primary: no copy in progress
jun 23 09:59:12 pg_node3 patroni: 2021-06-23 09:59:12.172 cst [32443] log: invalid record length at 0/50007c8: wanted 24, got 0
jun 23 09:59:12 pg_node3 patroni: 2021-06-23 09:59:12.179 cst [26335] fatal: could not connect to the primary server: could not connect to server: connection refused
jun 23 09:59:12 pg_node3 patroni: is the server running on host "192.168.210.15" and accepting
jun 23 09:59:12 pg_node3 patroni: tcp/ip connections on port 5432?
jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13,151 warning: request failed to pg1: get http://192.168.210.15:8008/patroni (httpconnectionpool(host=u'192.168.210.15', port=8008): max retries exceeded with url: /patroni (caused by protocolerror('connection aborted.', error(104, 'connection reset by peer'))))
jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13,153 info: got response from pg2 http://192.168.210.81:8008/patroni: {"database_system_identifier": "6976142033405049133", "postmaster_start_time": "2021-06-23 09:51:39.194 cst", "timeline": 6, "cluster_unlocked": false, "patroni": {"scope": "twpg", "version": "2.0.2"}, "state": "running", "role": "replica", "xlog": {"received_location": 83888072, "replayed_timestamp": null, "paused": false, "replayed_location": 83888072}, "server_version": 130003}
jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13,247 info: software watchdog activated with 25 second timeout, timing slack 15 seconds
jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13,264 info: promoted self to leader by acquiring session lock
jun 23 09:59:13 pg_node3 patroni: server promoting
jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13,273 info: lock owner: pg3; i am pg3
jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13.276 cst [32443] log: received promote request
jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13.276 cst [32443] log: redo done at 0/5000750
jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13,277 info: cleared rewind state after becoming the leader
jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13.289 cst [32443] log: selected new timeline id: 7
jun 23 09:59:13 pg_node3 patroni: this is patroni callback on_role_change master twpg
jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13,298 info: updated leader lock during promote
jun 23 09:59:13 pg_node3 systemd: started session c7 of user root.
jun 23 09:59:13 pg_node3 systemd: started session c8 of user root.
jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13.585 cst [32443] log: archive recovery complete
jun 23 09:59:13 pg_node3 patroni: 2021-06-23 09:59:13.650 cst [32441] log: database system is ready to accept connections
jun 23 09:59:14 pg_node3 patroni: 2021-06-23 09:59:14,318 info: lock owner: pg3; i am pg3
jun 23 09:59:14 pg_node3 patroni: 2021-06-23 09:59:14,435 info: no action. i am the leader with the lock
#
jun 23 09:59:10 pg_node2 patroni: 2021-06-23 09:59:10,228 info: lock owner: pg1; i am pg2
jun 23 09:59:10 pg_node2 patroni: 2021-06-23 09:59:10,228 info: does not have lock
jun 23 09:59:10 pg_node2 patroni: 2021-06-23 09:59:10,237 info: no action. i am a secondary and i am following a leader
jun 23 09:59:12 pg_node2 patroni: 2021-06-23 09:59:12.170 cst [26290] log: replication terminated by primary server
jun 23 09:59:12 pg_node2 patroni: 2021-06-23 09:59:12.170 cst [26290] detail: end of wal reached on timeline 6 at 0/50007c8.
jun 23 09:59:12 pg_node2 patroni: 2021-06-23 09:59:12.170 cst [26290] fatal: could not send end-of-streaming message to primary: no copy in progress
jun 23 09:59:12 pg_node2 patroni: 2021-06-23 09:59:12.170 cst [26286] log: invalid record length at 0/50007c8: wanted 24, got 0
jun 23 09:59:12 pg_node2 patroni: 2021-06-23 09:59:12.177 cst [26671] fatal: could not connect to the primary server: could not connect to server: connection refused
jun 23 09:59:12 pg_node2 patroni: is the server running on host "192.168.210.15" and accepting
jun 23 09:59:12 pg_node2 patroni: tcp/ip connections on port 5432?
jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,174 warning: request failed to pg1: get http://192.168.210.15:8008/patroni (httpconnectionpool(host=u'192.168.210.15', port=8008): max retries exceeded with url: /patroni (caused by newconnectionerror(': failed to establish a new connection: [errno 111] connection refused' ,)))
jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,188 info: got response from pg3 http://192.168.210.33:8008/patroni: {"database_system_identifier": "6976142033405049133", "postmaster_start_time": "2021-06-21 15:43:32.837 cst", "timeline": 6, "cluster_unlocked": true, "patroni": {"scope": "twpg", "version": "2.0.2"}, "state": "running", "role": "replica", "xlog": {"received_location": 83888072, "replayed_timestamp": null, "paused": false, "replayed_location": 83888072}, "server_version": 130003}
jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,277 info: could not take out ttl lock
jun 23 09:59:13 pg_node2 patroni: server signaled
jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13.291 cst [26284] log: received sighup, reloading configuration files
jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13.292 cst [26284] log: parameter "primary_conninfo" changed to "user=repuser passfile=/home/postgres/pgpass host=192.168.210.33 port=5432 sslmode=prefer application_name=pg2 gssencmode=prefer channel_binding=prefer"
jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13.296 cst [26682] fatal: could not connect to the primary server: could not connect to server: connection refused
jun 23 09:59:13 pg_node2 patroni: is the server running on host "192.168.210.15" and accepting
jun 23 09:59:13 pg_node2 patroni: tcp/ip connections on port 5432?
jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,330 info: following new leader after trying and failing to obtain lock
jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,333 info: lock owner: pg3; i am pg2
jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,333 info: does not have lock
jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,342 info: local timeline=6 lsn=0/50007c8
jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,349 info: master_timeline=6
jun 23 09:59:13 pg_node2 patroni: 2021-06-23 09:59:13,354 info: no action. i am a secondary and i am following a leader
jun 23 09:59:14 pg_node2 patroni: 2021-06-23 09:59:14,373 info: lock owner: pg3; i am pg2
jun 23 09:59:14 pg_node2 patroni: 2021-06-23 09:59:14,373 info: does not have lock
jun 23 09:59:14 pg_node2 patroni: 2021-06-23 09:59:14,375 info: no action. i am a secondary and i am following a leader
jun 23 09:59:14 pg_node2 patroni: 2021-06-23 09:59:14,491 info: lock owner: pg3; i am pg2
jun 23 09:59:14 pg_node2 patroni: 2021-06-23 09:59:14,492 info: does not have lock
jun 23 09:59:14 pg_node2 patroni: 2021-06-23 09:59:14,500 info: no action. i am a secondary and i am following a leader
jun 23 09:59:18 pg_node2 patroni: 2021-06-23 09:59:18.306 cst [26690] log: fetching timeline history file for timeline 7 from primary server
jun 23 09:59:18 pg_node2 patroni: 2021-06-23 09:59:18.320 cst [26690] log: started streaming wal from primary at 0/5000000 on timeline 6
jun 23 09:59:18 pg_node2 patroni: 2021-06-23 09:59:18.320 cst [26690] log: replication terminated by primary server
jun 23 09:59:18 pg_node2 patroni: 2021-06-23 09:59:18.320 cst [26690] detail: end of wal reached on timeline 6 at 0/50007c8.
jun 23 09:59:18 pg_node2 patroni: 2021-06-23 09:59:18.321 cst [26286] log: new target timeline is 7
jun 23 09:59:18 pg_node2 patroni: 2021-06-23 09:59:18.322 cst [26690] log: restarted wal streaming at 0/5000000 on timeline 7
jun 23 09:59:24 pg_node2 patroni: 2021-06-23 09:59:24,492 info: lock owner: pg3; i am pg2
jun 23 09:59:24 pg_node2 patroni: 2021-06-23 09:59:24,492 info: does not have lock
jun 23 09:59:24 pg_node2 patroni: 2021-06-23 09:59:24,513 info: no action. i am a secondary and i am following a leader
#从以上信息可以看到,当主库所在的patroni服务不可用时,会发生故障转移(pg3选举为了主库,vip进行漂移)
#模拟主库所在的patroni进程无响应
#kill掉patroni进程,会导致服务器reboot
[root@pg_node2 ~]# patronictl -c /etc/patroni/patroni.yml list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | replica | running | 10 | 0.0 |
| pg2 | 192.168.210.81:5432 | leader | running | 10 | |
| pg3 | 192.168.210.33:5432 | replica | running | 10 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
kill -9 `pgrep patroni`
[root@pg_node2 ~]# uptime
14:26:01 up 1 min, 1 user, load average: 1.35, 0.41, 0.14
[root@pg_node2 ~]# patronictl -c /etc/patroni/patroni.yml list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | replica | running | 10 | 0.0 |
| pg2 | 192.168.210.81:5432 | replica | running | 10 | 0.0 |
| pg3 | 192.168.210.33:5432 | leader | running | 10 | |
-------- --------------------- --------- --------- ---- -----------
#模拟主机不可用(主库所在主机)
#立刻关机:poweroff
[root@pg_node3 ~]# uptime
14:38:38 up 4 days, 22:08, 3 users, load average: 0.03, 0.09, 0.07
[root@pg_node3 ~]# patronictl -c /etc/patroni/patroni.yml list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | replica | running | 10 | 0.0 |
| pg2 | 192.168.210.81:5432 | replica | running | 10 | 0.0 |
| pg3 | 192.168.210.33:5432 | leader | running | 10 | |
-------- --------------------- --------- --------- ---- -----------
[root@pg_node3 ~]# poweroff
[root@pg_node2 ~]# patronictl -c /etc/patroni/patroni.yml list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | leader | running | 11 | |
| pg2 | 192.168.210.81:5432 | replica | running | 11 | 0.0 |
| pg3 | 192.168.210.33:5432 | replica | stopped | | unknown |
-------- --------------------- --------- --------- ---- -----------
[root@pg_node2 ~]# patronictl -c /etc/patroni/patroni.yml list
2021-06-23 14:42:14,614 - error - request to server http://192.168.210.33:2379 failed: maxretryerror("httpconnectionpool(host=u'192.168.210.33', port=2379): max retries exceeded with url: /v3/kv/range (caused by connecttimeouterror(, u'connection to 192.168.210.33 timed out. (connect timeout=1.25)'))" ,)
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | leader | running | 11 | |
| pg2 | 192.168.210.81:5432 | replica | running | 11 | 0.0 |
| pg3 | 192.168.210.33:5432 | replica | stopped | | unknown |
-------- --------------------- --------- --------- ---- -----------
[root@pg_node2 ~]# patronictl -c /etc/patroni/patroni.yml list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | leader | running | 11 | |
| pg2 | 192.168.210.81:5432 | replica | running | 11 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
#再次启动pg3节点
[root@pg_node3 ~]# uptime
14:46:14 up 0 min, 1 user, load average: 0.71, 0.16, 0.05
[root@pg_node3 ~]# patronictl -c /etc/patroni/patroni.yml list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | leader | running | 11 | |
| pg2 | 192.168.210.81:5432 | replica | running | 11 | 0.0 |
| pg3 | 192.168.210.33:5432 | replica | running | 10 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
#手工switchover
[root@pg_node1 ~]# patronictl switchover
master [pg1]:
candidate ['pg2', 'pg3'] []: ^caborted!
[root@pg_node1 ~]# patronictl switchover
master [pg1]:
candidate ['pg2', 'pg3'] []: pg2
when should the switchover take place (e.g. 2021-06-23t16:12 ) [now]:
current cluster topology
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | leader | running | 11 | |
| pg2 | 192.168.210.81:5432 | replica | running | 11 | 0.0 |
| pg3 | 192.168.210.33:5432 | replica | running | 11 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
are you sure you want to switchover cluster twpg, demoting current master pg1? [y/n]: y
2021-06-23 15:13:26.97922 successfully switched over to "pg2"
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | replica | stopped | | unknown |
| pg2 | 192.168.210.81:5432 | leader | running | 11 | |
| pg3 | 192.168.210.33:5432 | replica | running | 11 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
[root@pg_node1 ~]# patronictl list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | replica | stopped | | unknown |
| pg2 | 192.168.210.81:5432 | leader | running | 12 | |
| pg3 | 192.168.210.33:5432 | replica | running | 11 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
[root@pg_node1 ~]# patronictl list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | replica | running | 12 | 0.0 |
| pg2 | 192.168.210.81:5432 | leader | running | 12 | |
| pg3 | 192.168.210.33:5432 | replica | running | 12 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
#日志信息如下:
jun 23 15:13:16 pg_node1 patroni: 2021-06-23 15:13:16,311 info: lock owner: pg1; i am pg1
jun 23 15:13:16 pg_node1 patroni: 2021-06-23 15:13:16,316 info: no action. i am the leader with the lock
jun 23 15:13:24 pg_node1 patroni: 2021-06-23 15:13:24,854 info: received switchover request with leader=pg1 candidate=pg2 scheduled_at=none
jun 23 15:13:24 pg_node1 patroni: 2021-06-23 15:13:24,879 info: got response from pg2 http://192.168.210.81:8008/patroni: {"database_system_identifier": "6976142033405049133", "postmaster_start_time": "2021-06-23 14:25:40.252 cst", "timeline": 11, "cluster_unlocked": false, "patroni": {"scope": "twpg", "version": "2.0.2"}, "state": "running", "role": "replica", "xlog": {"received_location": 84186784, "replayed_timestamp": "2021-06-23 14:25:40.045 cst", "paused": false, "replayed_location": 84186784}, "server_version": 130003}
jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25,011 info: lock owner: pg1; i am pg1
jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25,031 info: got response from pg2 http://192.168.210.81:8008/patroni: {"database_system_identifier": "6976142033405049133", "postmaster_start_time": "2021-06-23 14:25:40.252 cst", "timeline": 11, "cluster_unlocked": false, "patroni": {"scope": "twpg", "version": "2.0.2"}, "state": "running", "role": "replica", "xlog": {"received_location": 84186784, "replayed_timestamp": "2021-06-23 14:25:40.045 cst", "paused": false, "replayed_location": 84186784}, "server_version": 130003}
jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25,134 info: manual failover: demoting myself
jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25.222 cst [4256] log: received fast shutdown request
jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25.234 cst [4256] log: aborting any active transactions
jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25.234 cst [4270] fatal: terminating connection due to administrator command
jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25.235 cst [4256] log: background worker "logical replication launcher" (pid 4522) exited with exit code 1
jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25.237 cst [4259] log: shutting down
jun 23 15:13:25 pg_node1 patroni: 2021-06-23 15:13:25.306 cst [4256] log: database system is shut down
jun 23 15:13:26 pg_node1 patroni: this is patroni callback on_stop master twpg
jun 23 15:13:26 pg_node1 systemd: started session c5 of user root.
jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,275 info: leader key released
jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,277 info: lock owner: none; i am pg1
jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,277 info: not healthy enough for leader race
jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,277 info: manual failover: demote in progress
jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,278 info: lock owner: none; i am pg1
jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,279 info: not healthy enough for leader race
jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,279 info: manual failover: demote in progress
jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,296 info: lock owner: pg2; i am pg1
jun 23 15:13:26 pg_node1 patroni: 2021-06-23 15:13:26,297 info: manual failover: demote in progress
jun 23 15:13:27 pg_node1 patroni: 2021-06-23 15:13:27,416 info: lock owner: pg2; i am pg1
jun 23 15:13:27 pg_node1 patroni: 2021-06-23 15:13:27,417 info: manual failover: demote in progress
jun 23 15:13:27 pg_node1 patroni: 2021-06-23 15:13:27,532 info: lock owner: pg2; i am pg1
jun 23 15:13:27 pg_node1 patroni: 2021-06-23 15:13:27,532 info: manual failover: demote in progress
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28,292 info: local timeline=11 lsn=0/5049750
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28,300 info: master_timeline=12
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28,315 info: master: history=8#0110/5014b48#011no recovery target specified
jun 23 15:13:28 pg_node1 patroni: 9#0110/502cce8#011no recovery target specified
jun 23 15:13:28 pg_node1 patroni: 10#0110/5048a98#011no recovery target specified
jun 23 15:13:28 pg_node1 patroni: 11#0110/50497c8#011no recovery target specified
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28,317 info: closed patroni connection to the postgresql cluster
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28,348 info: postmaster pid=4739
jun 23 15:13:28 pg_node1 patroni: localhost:5432 - no response
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.386 cst [4739] log: starting postgresql 13.3 on x86_64-pc-linux-gnu, compiled by gcc (gcc) 4.8.5 20150623 (red hat 4.8.5-44), 64-bit
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.386 cst [4739] log: listening on ipv4 address "0.0.0.0", port 5432
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.401 cst [4739] log: listening on unix socket "/tmp/.s.pgsql.5432"
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.417 cst [4741] log: database system was shut down at 2021-06-23 15:13:25 cst
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.418 cst [4741] log: entering standby mode
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.428 cst [4741] log: consistent recovery state reached at 0/50497c8
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.428 cst [4741] log: invalid record length at 0/50497c8: wanted 24, got 0
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.429 cst [4739] log: database system is ready to accept read only connections
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.447 cst [4745] log: fetching timeline history file for timeline 12 from primary server
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.460 cst [4745] log: started streaming wal from primary at 0/5000000 on timeline 11
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.463 cst [4745] log: replication terminated by primary server
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.463 cst [4745] detail: end of wal reached on timeline 11 at 0/50497c8.
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.463 cst [4741] log: new target timeline is 12
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.464 cst [4745] log: restarted wal streaming at 0/5000000 on timeline 12
jun 23 15:13:28 pg_node1 patroni: 2021-06-23 15:13:28.703 cst [4741] log: redo starts at 0/50497c8
jun 23 15:13:29 pg_node1 patroni: localhost:5432 - accepting connections
jun 23 15:13:29 pg_node1 patroni: localhost:5432 - accepting connections
jun 23 15:13:29 pg_node1 patroni: this is patroni callback on_role_change replica twpg
jun 23 15:13:29 pg_node1 systemd: started session c6 of user root.
jun 23 15:13:37 pg_node1 patroni: 2021-06-23 15:13:37,534 info: lock owner: pg2; i am pg1
jun 23 15:13:37 pg_node1 patroni: 2021-06-23 15:13:37,534 info: does not have lock
jun 23 15:13:37 pg_node1 patroni: 2021-06-23 15:13:37,535 info: establishing a new patroni connection to the postgres cluster
jun 23 15:13:37 pg_node1 patroni: 2021-06-23 15:13:37,579 info: no action. i am a secondary and i am following a leader
#手工failover
[root@pg_node1 ~]# patronictl failover
candidate ['pg1', 'pg3'] []: pg1
current cluster topology
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | replica | running | 12 | 0.0 |
| pg2 | 192.168.210.81:5432 | leader | running | 12 | |
| pg3 | 192.168.210.33:5432 | replica | running | 12 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
are you sure you want to failover cluster twpg, demoting current master pg2? [y/n]: y
2021-06-23 15:17:33.25244 successfully failed over to "pg1"
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | leader | running | 12 | |
| pg2 | 192.168.210.81:5432 | replica | stopped | | unknown |
| pg3 | 192.168.210.33:5432 | replica | running | 12 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
[root@pg_node1 ~]# patronictl list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | leader | running | 13 | |
| pg2 | 192.168.210.81:5432 | replica | stopped | | unknown |
| pg3 | 192.168.210.33:5432 | replica | running | 13 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
[root@pg_node1 ~]# patronictl list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | leader | running | 13 | |
| pg2 | 192.168.210.81:5432 | replica | running | 13 | 0.0 |
| pg3 | 192.168.210.33:5432 | replica | running | 13 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
#日志信息如下:
jun 23 15:17:27 pg_node1 patroni: 2021-06-23 15:17:27,609 info: lock owner: pg2; i am pg1
jun 23 15:17:27 pg_node1 patroni: 2021-06-23 15:17:27,609 info: does not have lock
jun 23 15:17:27 pg_node1 patroni: 2021-06-23 15:17:27,612 info: no action. i am a secondary and i am following a leader
jun 23 15:17:31 pg_node1 patroni: 2021-06-23 15:17:31.544 cst [4745] log: replication terminated by primary server
jun 23 15:17:31 pg_node1 patroni: 2021-06-23 15:17:31.544 cst [4745] detail: end of wal reached on timeline 12 at 0/504a428.
jun 23 15:17:31 pg_node1 patroni: 2021-06-23 15:17:31.544 cst [4745] fatal: could not send end-of-streaming message to primary: no copy in progress
jun 23 15:17:31 pg_node1 patroni: 2021-06-23 15:17:31.545 cst [4741] log: invalid record length at 0/504a428: wanted 24, got 0
jun 23 15:17:31 pg_node1 patroni: 2021-06-23 15:17:31.550 cst [4810] fatal: could not connect to the primary server: could not connect to server: connection refused
jun 23 15:17:31 pg_node1 patroni: is the server running on host "192.168.210.81" and accepting
jun 23 15:17:31 pg_node1 patroni: tcp/ip connections on port 5432?
jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32,556 info: cleaning up failover key after acquiring leader lock...
jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32,565 info: software watchdog activated with 25 second timeout, timing slack 15 seconds #激活看门狗
jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32,577 info: promoted self to leader by acquiring session lock
jun 23 15:17:32 pg_node1 patroni: server promoting #数据库提升
jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32,579 info: lock owner: pg1; i am pg1
jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32.589 cst [4741] log: received promote request
jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32.589 cst [4741] log: redo done at 0/504a3b0
jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32,593 info: cleared rewind state after becoming the leader
jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32.594 cst [4741] log: selected new timeline id: 13
jun 23 15:17:32 pg_node1 patroni: this is patroni callback on_role_change master twpg #回调vip漂移脚本
jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32,611 info: updated leader lock during promote
jun 23 15:17:32 pg_node1 systemd: started session c7 of user root.
jun 23 15:17:32 pg_node1 systemd: started session c8 of user root.
jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32.837 cst [4741] log: archive recovery complete
jun 23 15:17:32 pg_node1 patroni: 2021-06-23 15:17:32.869 cst [4739] log: database system is ready to accept connections
jun 23 15:17:33 pg_node1 patroni: 2021-06-23 15:17:33,631 info: lock owner: pg1; i am pg1
jun 23 15:17:33 pg_node1 patroni: 2021-06-23 15:17:33,772 info: no action. i am the leader with the lock
#查看集群参数
[root@pg_node1 ~]# patronictl show-config
loop_wait: 10
master_start_timeout: 300
maximum_lag_on_failover: 1048576
postgresql:
parameters:
hot_standby: 'on'
listen_addresses: 0.0.0.0
max_replication_slots: 10
max_wal_senders: 10
port: 5432
wal_keep_segments: 256
wal_level: replica
wal_log_hints: 'on'
use_pg_rewind: true
use_slots: true
retry_timeout: 10
synchronous_mode: false
ttl: 30
#修改pg参数
#例如修改shared_buffers: 1gb
[root@pg_node1 ~]# patronictl edit-config twpg
---
@@ -4,6 4,7 @@
postgresql:
parameters:
hot_standby: 'on'
shared_buffers: '1gb'
listen_addresses: 0.0.0.0
max_replication_slots: 10
max_wal_senders: 10
apply these changes? [y/n]: y
configuration changed
[root@pg_node1 ~]# patronictl restart twpg
cluster: twpg (6976142033405049133) --- --------- ---- ----------- -----------------
| member | host | role | state | tl | lag in mb | pending restart |
-------- --------------------- --------- --------- ---- ----------- -----------------
| pg1 | 192.168.210.15:5432 | leader | running | 13 | | * |
| pg2 | 192.168.210.81:5432 | replica | running | 13 | 0.0 | * |
| pg3 | 192.168.210.33:5432 | replica | running | 13 | 0.0 | * |
-------- --------------------- --------- --------- ---- ----------- -----------------
when should the restart take place (e.g. 2021-06-23t16:45) [now]:
are you sure you want to restart members pg2, pg3, pg1? [y/n]: y
restart if the postgresql version is less than provided (e.g. 9.5.2) []:
success: restart on member pg2
success: restart on member pg3
success: restart on member pg1
[root@pg_node2 ~]# psql -u twsm -d postgres -c 'show shared_buffers;'
shared_buffers
----------------
1gb
(1 row)
#或者通过rest api修改,例如:max_connections修改为1000
curl -s -xpatch -d '{"postgresql":{"parameters":{"max_connections":"1000"}}}' http://localhost:8008/config | jq .
#如果想reset或是删除
curl -s -xpatch -d '{"postgresql":{"parameters":{"max_connections":null}}}' http://localhost:8008/config | jq .
#修改max_connections需要重启(pending restart)
[root@pg_node1 ~]# patronictl list
cluster: twpg (6976142033405049133) --- --------- ---- ----------- -----------------
| member | host | role | state | tl | lag in mb | pending restart |
-------- --------------------- --------- --------- ---- ----------- -----------------
| pg1 | 192.168.210.15:5432 | replica | running | 14 | 0.0 | * |
| pg2 | 192.168.210.81:5432 | replica | running | 14 | 0.0 | * |
| pg3 | 192.168.210.33:5432 | leader | running | 14 | | * |
-------- --------------------- --------- --------- ---- ----------- -----------------
#如果想无条件地完全重写现有的动态配置
curl -s -xput -d '{"maximum_lag_on_failover":1048576,"retry_timeout":10,"postgresql":{"use_slots":true,"use_pg_rewind":true,"parameters":{"hot_standby":"on","wal_log_hints":"on","wal_level":"hot_standby","unix_socket_directories":".","max_wal_senders":5}},"loop_wait":3,"ttl":20}' http://localhost:8008/config | jq .
#查看历史failovers/switchovers
[root@pg_node1 ~]# patronictl history
---- ---------- ------------------------------ ----------------------------------
| tl | lsn | reason | timestamp |
---- ---------- ------------------------------ ----------------------------------
| 1 | 25210568 | no recovery target specified | 2021-06-21t15:24:42.878129 08:00 |
| 2 | 25211144 | no recovery target specified | 2021-06-21t15:33:23.462860 08:00 |
| 3 | 83886408 | no recovery target specified | 2021-06-23t09:28:25.304246 08:00 |
| 4 | 83886920 | no recovery target specified | 2021-06-23t09:35:14.536083 08:00 |
| 5 | 83887496 | no recovery target specified | 2021-06-23t09:37:56.880226 08:00 |
| 6 | 83888072 | no recovery target specified | 2021-06-23t09:59:13.584445 08:00 |
| 7 | 83888704 | no recovery target specified | 2021-06-23t12:39:14.897591 08:00 |
| 8 | 83970888 | no recovery target specified | 2021-06-23t13:55:26.207367 08:00 |
| 9 | 84069608 | no recovery target specified | 2021-06-23t14:24:48.030224 08:00 |
| 10 | 84183704 | no recovery target specified | 2021-06-23t14:41:45.263156 08:00 |
| 11 | 84187080 | no recovery target specified | 2021-06-23t15:13:26.627326 08:00 |
| 12 | 84190248 | no recovery target specified | 2021-06-23t15:17:32.836962 08:00 |
---- ---------- ------------------------------ ----------------------------------
#集群进入维护模式,防止自动故障转移
---- ---------- ------------------------------ ----------------------------------
[root@pg_node1 ~]# patronictl pause
success: cluster management is paused
[root@pg_node1 ~]# patronictl list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | leader | running | 13 | |
| pg2 | 192.168.210.81:5432 | replica | running | 13 | 0.0 |
| pg3 | 192.168.210.33:5432 | replica | running | 13 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
maintenance mode: on
#日志信息:
jun 23 16:00:02 pg_node1 patroni: 2021-06-23 16:00:02,006 info: lock owner: pg1; i am pg1
jun 23 16:00:02 pg_node1 patroni: 2021-06-23 16:00:02,018 info: pause: no action. i am the leader with the lock
jun 23 16:00:02 pg_node1 patroni: 2021-06-23 16:00:02,028 info: no postgresql configuration items changed, nothing to reload.
#维护结束,恢复自动故障转移
[root@pg_node1 ~]# patronictl resume
success: cluster management is resumed
[root@pg_node1 ~]# patronictl list
cluster: twpg (6976142033405049133) --- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
-------- --------------------- --------- --------- ---- -----------
| pg1 | 192.168.210.15:5432 | leader | running | 13 | |
| pg2 | 192.168.210.81:5432 | replica | running | 13 | 0.0 |
| pg3 | 192.168.210.33:5432 | replica | running | 13 | 0.0 |
-------- --------------------- --------- --------- ---- -----------
#日志信息:
jun 23 16:02:32 pg_node1 patroni: 2021-06-23 16:02:32,006 info: lock owner: pg1; i am pg1
jun 23 16:02:32 pg_node1 patroni: 2021-06-23 16:02:32,012 info: pause: no action. i am the leader with the lock
jun 23 16:02:39 pg_node1 patroni: 2021-06-23 16:02:39,565 info: lock owner: pg1; i am pg1
jun 23 16:02:39 pg_node1 patroni: 2021-06-23 16:02:39,572 info: software watchdog activated with 25 second timeout, timing slack 15 seconds
jun 23 16:02:39 pg_node1 patroni: 2021-06-23 16:02:39,581 info: no action. i am the leader with the lock
jun 23 16:02:39 pg_node1 patroni: 2021-06-23 16:02:39,588 info: no postgresql configuration items changed, nothing to reload.
jun 23 16:02:49 pg_node1 patroni: 2021-06-23 16:02:49,552 info: lock owner: pg1; i am pg1
jun 23 16:02:49 pg_node1 patroni: 2021-06-23 16:02:49,558 info: no action. i am the leader with the lock
故障场景及处理方式:
故障位置
故障场景
patroni动作
备库
pg备库停止
拉起pg备库服务
备库
正常停止备库patroni
停止pg备库
备库
备库patroni异常停止
无动作
备库
备库无法连接etcd
无动作
备库
非leader角色但pg处于生产模式
重启pg并切换到恢复模式作为备库
备库
备库主机重启
patroni启动拉起pg备库
主库
pg主库停止
启动pg,如果启动时间超过master_start_timeout,进行主备切换
主库
正常停止主库patroni
关闭主库,从备库中选举新主库
主库
主库patroni异常停止
看门狗重启主机,启动后如果拿到leader锁,主备不切换,否则选举新主,切换主备
主库
主库无法连接etcd
主库降级为备库,触发故障转移
-
etcd故障
主库降级,集群中全部备库
-
同步模式下,无可用同步备库
临时切换为异步模式,在恢复同步模式之前自动故障转移不可用
参考:
https://patroni.readthedocs.io/en/latest/readme.html
https://github.com/zalando/patroni/blob/master/docs/settings.rst
http://blog.itpub.net/30496307/viewspace-2764349/
https://mp.weixin.qq.com/s/edvwktb-wf7yyvafz5gcfw
最后修改时间:2021-06-26 15:48:37 「喜欢文章,快来给作者赞赏墨值吧」 【米乐app官网下载的版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。 评论