说明:本系列文章仅用于共享我的学习成果,未经过生产系统考验,对于知识点和一些组件的使用会不定时更新,仅供参考,如有错误的地方,欢迎留言共同学习。
本高可用系列测试不说理论知识,如有需要自行百度,因生产环境大多数是内网环境,无法连接互联网,为模拟生产环境安装,postgresql高可用测试均采用离线部署。
所需软件包均以打包上传百度网盘,如有需要自行下载: 提取码:n9w2 文件名:postgresql_ha.tar.gz
第一章: 介绍测试环境
第二章: postgresql replication 部署
第三章: etcd 部署和管理
第四章: patroni 部署和管理
第五章: haproxy keepalived 部署和管理
第六章: 高可用模拟故障测试用例
第七章: prometheus grafana 监控部署
第八章: 高可用管理
- patroni 基于 python 开发的模板,结合 dcs (例如 zookeeper, etcd, consul )可以定制 postgresql 高可用方案。
- patroni 并不是一套拿来即用的 postgresql 高可用组件,涉及较多的配置和定制工作。
- patroni 接管 postgresql 数据库的启停,同时监控本地的 postgresql 数据库,并将本地的 postgresql 数据库信息写入dcs。
- patroni 的主备端是通过是否能获得 leader key 来控制的,获取到了 leader key 的 patroni 为主节点,其它的为备节点。
- patroni 支持级联复制,支持同步和异步模式,支持 failover、switchovers、重新初始化集群等。
- patroni 官方文档:
1. 所有节点安装python3
patroni 基于 python 开发的模板,需要运行在 python 环境下。
# 解压并安装python3
[root@pgtest1 ~]# cd /enmo/soft
[root@pgtest1 soft]# tar -zxvf python3-rpm.tar.gz
# 配置yum源
[root@pgtest1 ~]# cat > /etc/yum.repos.d/enmo.repo << eof
[server]
name=server
baseurl=file:///media/cdrom
enabled=yes
gpgcheck=0
[python3]
name=python3
baseurl=file:///enmo/soft/python3-rpm
enabled=yes
gpgcheck=0
eof
# 执行yum安装
[root@pgtest1 ~]# yum clean all
[root@pgtest1 ~]# yum install python3 python3-devel -y
# 修改软链接
[root@pgtest1 ~]# rm -f /usr/bin/python
[root@pgtest1 ~]# ln -s /usr/bin/python3 /usr/bin/python
# 查看版本,确认安装成功
[root@pgtest1 ~]# python -v
python 3.6.8
# 全词匹配替换,python3安装后’yum’命令执行会报错,需要修改以下配置
[root@pgtest1 ~]# sed -i "s:\:python2:g" /usr/bin/yum
[root@pgtest1 ~]# sed -i "s:\:python2:g" /usr/libexec/urlgrabber-ext-down
2. 所有节点使用pip3安装patroni
在线连接互联网安装
# pip3 install patroni[etcd] -i https://mirrors.aliyun.com/pypi/simple/
# pip3 install psycopg2-binary -i https://mirrors.aliyun.com/pypi/simple/
# pip3 install psycopg2 -i https://mirrors.aliyun.com/pypi/simple/
以下使用离线安装
软件包下载,网盘中postgresql_ha.tar.gz文件中已包含以下安装包(patroni_etcd_2.1.1.tar.gz)
psutil-5.8.0.tar.gz https://mirrors.aliyun.com/pypi/simple/psutil/
ydiff-1.2.tar.gz https://mirrors.aliyun.com/pypi/simple/ydiff/
click-8.0.3-py3-none-any.whl https://mirrors.aliyun.com/pypi/simple/click/
dnspython-2.1.0-py3-none-any.whl https://mirrors.aliyun.com/pypi/simple/dnspython/
importlib_metadata-4.8.1-py3-none-any.whl https://mirrors.aliyun.com/pypi/simple/importlib-metadata/
patroni-2.1.1-py3-none-any.whl https://mirrors.aliyun.com/pypi/simple/patroni/
prettytable-2.2.1-py3-none-any.whl https://mirrors.aliyun.com/pypi/simple/prettytable/
python_dateutil-2.8.2-py2.py3-none-any.whl https://mirrors.aliyun.com/pypi/simple/python-dateutil/
pyyaml-6.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl https://mirrors.aliyun.com/pypi/simple/pyyaml/
six-1.16.0-py2.py3-none-any.whl https://mirrors.aliyun.com/pypi/simple/six/
typing_extensions-3.10.0.2-py3-none-any.whl https://mirrors.aliyun.com/pypi/simple/typing-extensions/
urllib3-1.26.7-py2.py3-none-any.whl https://mirrors.aliyun.com/pypi/simple/urllib3/
wcwidth-0.2.5-py2.py3-none-any.whl https://mirrors.aliyun.com/pypi/simple/wcwidth/
zipp-3.6.0-py3-none-any.whl https://mirrors.aliyun.com/pypi/simple/zipp/
python-etcd-0.4.5.tar.gz https://mirrors.aliyun.com/pypi/simple/python-etcd/
psycopg2-binary-2.9.1.tar.gz https://mirrors.aliyun.com/pypi/simple/psycopg2-binary/
psycopg2-2.9.1.tar.gz https://mirrors.aliyun.com/pypi/simple/psycopg2/
解压安装包并使用pip3安装patroni
[root@pgtest1 ~]# cd /enmo/soft
[root@pgtest1 soft]# tar -zxvf patroni_etcd_2.1.1.tar.gz
[root@pgtest1 soft]# cd patroni_etcd_2.1.1
# 按顺序执行安装
[root@pgtest1 patroni_etcd_2.1.1]# pip3 install psutil-5.8.0.tar.gz
[root@pgtest1 patroni_etcd_2.1.1]# pip3 install ydiff-1.2.tar.gz
[root@pgtest1 patroni_etcd_2.1.1]# pip3 install *.whl
[root@pgtest1 patroni_etcd_2.1.1]# pip3 install python-etcd-0.4.5.tar.gz
[root@pgtest1 patroni_etcd_2.1.1]# pip3 install psycopg2-binary-2.9.1.tar.gz
[root@pgtest1 patroni_etcd_2.1.1]# pip3 install psycopg2-2.9.1.tar.gz
# 查看已安装的包
[root@pgtest3 patroni_etcd_2.1.1]# pip3 list
deprecation: the default format will switch to columns in the future. you can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
click (8.0.3)
dnspython (2.1.0)
importlib-metadata (4.8.1)
patroni (2.1.1)
pip (9.0.3)
prettytable (2.2.1)
psutil (5.8.0)
psycopg2 (2.9.1)
psycopg2-binary (2.9.1)
python-dateutil (2.8.2)
python-etcd (0.4.5)
pyyaml (6.0)
setuptools (39.2.0)
six (1.16.0)
typing-extensions (3.10.0.2)
urllib3 (1.26.7)
wcwidth (0.2.5)
ydiff (1.2)
zipp (3.6.0)
2. 所有节点配置patroni的参数文件
2.1 创建参数文件和日志文件的存放路径
[root@pgtest1 ~]# mkdir /enmo/app/patroni
2.2 主节点创建文件 patroni_config.yml
需要注意python的yml文件格式,有严格的缩进要求,且以空格进行缩进,不要使用tab键,缩进控制不好的话,参数配置将出现各种问题。
[root@pgtest1 ~]# vi /enmo/app/patroni/patroni_config.yml
scope: pg_cluster
namespace: /service
name: pgtest1
log:
level: info
traceback_level: error
dir: /enmo/app/patroni
file_num: 10
file_size: 104857600
restapi:
listen: 192.168.58.10:8008
connect_address: 192.168.58.10:8008
etcd:
host: 192.168.58.10:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
max_timelines_history: 0
master_start_timeout: 300
master_stop_timeout: 0
synchronous_mode: false
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
max_connections: 3000
superuser_reserved_connections: 100
max_locks_per_transaction: 64
max_worker_processes: 2
max_prepared_transactions: 0
wal_level: logical
wal_log_hints: on
track_commit_timestamp: off
max_wal_senders: 10
max_replication_slots: 10
wal_keep_size: 4096mb # wal_keep_size = wal_keep_segments * wal_segment_size = 128 * 32mb
hot_standby: "on"
listen_addresses: "*"
port: 5432
cluster_name: "pg_cluster"
archive_mode: on
archive_command: "cp %p /enmo/pgarch/%f"
postgresql:
listen: 0.0.0.0:5432
connect_address: 192.168.58.10:5432
data_dir: /enmo/pgdata
pgpass: /home/postgres/.pgpass
pg_ctl_timeout: 60
use_pg_rewind: true
remove_data_directory_on_rewind_failure: false
remove_data_directory_on_diverged_timelines: true
authentication:
replication:
username: replica
password: replica
superuser:
username: postgres
password: postgres
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
2.3 备节点1创建文件 patroni_config.yml
[root@pgtest2 ~]# vi /enmo/app/patroni/patroni_config.yml
scope: pg_cluster
namespace: /service
name: pgtest2
log:
level: info
traceback_level: error
dir: /enmo/app/patroni
file_num: 10
file_size: 104857600
restapi:
listen: 192.168.58.11:8008
connect_address: 192.168.58.11:8008
etcd:
host: 192.168.58.11:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
max_timelines_history: 0
master_start_timeout: 300
master_stop_timeout: 0
synchronous_mode: false
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
max_connections: 3000
superuser_reserved_connections: 100
max_locks_per_transaction: 64
max_worker_processes: 2
max_prepared_transactions: 0
wal_level: logical
wal_log_hints: on
track_commit_timestamp: off
max_wal_senders: 10
max_replication_slots: 10
wal_keep_size: 4096mb # wal_keep_size = wal_keep_segments * wal_segment_size = 128 * 32mb
hot_standby: "on"
listen_addresses: "*"
port: 5432
cluster_name: "pg_cluster"
archive_mode: on
archive_command: "cp %p /enmo/pgarch/%f"
postgresql:
listen: 0.0.0.0:5432
connect_address: 192.168.58.11:5432
data_dir: /enmo/pgdata
pgpass: /home/postgres/.pgpass
pg_ctl_timeout: 60
use_pg_rewind: true
remove_data_directory_on_rewind_failure: false
remove_data_directory_on_diverged_timelines: true
authentication:
replication:
username: replica
password: replica
superuser:
username: postgres
password: postgres
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
2.4 备节点2创建文件 patroni_config.yml
[root@pgtest3 ~]# vi /enmo/app/patroni/patroni_config.yml
scope: pg_cluster
namespace: /service
name: pgtest3
log:
level: info
traceback_level: error
dir: /enmo/app/patroni
file_num: 10
file_size: 104857600
restapi:
listen: 192.168.58.12:8008
connect_address: 192.168.58.12:8008
etcd:
host: 192.168.58.12:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
max_timelines_history: 0
master_start_timeout: 300
master_stop_timeout: 0
synchronous_mode: false
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
max_connections: 3000
superuser_reserved_connections: 100
max_locks_per_transaction: 64
max_worker_processes: 2
max_prepared_transactions: 0
wal_level: logical
wal_log_hints: on
track_commit_timestamp: off
max_wal_senders: 10
max_replication_slots: 10
wal_keep_size: 4096mb # wal_keep_size = wal_keep_segments * wal_segment_size = 128 * 32mb
hot_standby: "on"
listen_addresses: "*"
port: 5432
cluster_name: "pg_cluster"
archive_mode: on
archive_command: "cp %p /enmo/pgarch/%f"
postgresql:
listen: 0.0.0.0:5432
connect_address: 192.168.58.12:5432
data_dir: /enmo/pgdata
pgpass: /home/postgres/.pgpass
pg_ctl_timeout: 60
use_pg_rewind: true
remove_data_directory_on_rewind_failure: false
remove_data_directory_on_diverged_timelines: true
authentication:
replication:
username: replica
password: replica
superuser:
username: postgres
password: postgres
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
2.5 建议由 patroni 托管的 postgresql 参数:
- 主库和备库必须保持相同的值的参数,对于那些,在本地 patoni 配置文件中或通过环境变量设置的值不起作用。要更改或设置它们的值,必须更改 dcs 中的共享配置。
max_connections: 3000
max_locks_per_transaction: 64
max_worker_processes: 2
max_prepared_transactions: 0
wal_level: logical
wal_log_hints: on
track_commit_timestamp: off
- 对于下面的参数,postgresql 不要求主库和备库必须保持相同的值。但是,考虑到一个备库随时可能成为主库的可能性,将它们设置为不同的值实际上没有意义;因此,patroni 将其值设置为动态配置。
max_wal_senders: 10
max_replication_slots: 10
wal_keep_segments: 8
# postgresql 13 版本将 wal_keep_segments 重新定义为 wal_keep_size,决定了为备库保留的wal量。版本13采用字节大小表示,不再采用保留文件的个数,可通过下述公式换算:wal_keep_size = wal_keep_segments * wal_segment_size
wal_keep_size: 4096mb
- 还有一些其他 postgres 参数由 patroni 控制
listen_addresses: "*"
port: 5432
cluster_name: "pg_cluster"
hot_standby: "on"
为了安全起见,上述列表中的参数不会写入 postgresql.conf,而是作为参数列表传递给 pg_ctl start,这赋予它们最高优先级,甚至高于 alter system
参数文件的优先顺序
- the parameters would be applied in the following order (run-time are given the highest priority):
- load parameters from file postgresql.base.conf (or from a custom_conf file, if set)
- load parameters from file postgresql.conf
- load parameters from file postgresql.auto.conf
- run-time parameter using -o –name=value
2.6 修改目录权限
# chown -r postgres.postgres /enmo/app/patroni
2.7 参数解释
参考官方文档:
[root@pgtest1 patroni]# cat patroni_config.yml
# 集群名称
scope: pg_cluster # 集群名称
namespace: /service # patroni 将在其中保存有关集群的信息,这个路径是 etcd 存储数据的路径, default value: "/service"
name: pgtest1 # 主机名,对于集群必须是唯一的
log:
level: info # 日志级别
traceback_level: error
dir: /enmo/app/patroni/ # 日志写入的目录
file_num: 10 # 要保留的日志数量
file_size: 104857600 # 触发日志滚动的 patoni.log 文件的大小(以字节为单位)
restapi:
listen: 192.168.58.10:8008
connect_address: 192.168.58.10:8008
etcd:
# provide host to do the initial discovery of the cluster topology:
# 必须指定host、hosts、url、proxy或 srv 之一
host: 192.168.58.10:2379 # etcd 端点的 host:port
bootstrap:
# this section will be written into etcd:///config after initializing new cluster
# and all other cluster members will use it as a `global configuration`
dcs: # 动态配置(dynamic configuration)的参数设置,动态配置存储在 dcs(分布式配置存储)中并应用于所有集群节点
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 如果从库落后于主库超过一定数量的字节,则此设置可确保不会发生故障转移
maximum_lag_on_syncnode: -1
max_timelines_history: 0 # dcs 中保存的最大时间线历史项目数。默认值:0。当设置为 0 时,它会在 dcs 中保留完整的历史记录。
master_start_timeout: 300 # 在触发 failover 之前允许主服务器从故障中恢复的时间(单位:秒)
# master 故障的最坏情况故障转移时间是: loop_wait master_start_timeout loop_wait
master_stop_timeout: 0 # patroni 停止 postgres 时允许等待的秒数,仅在启用 synchronous_mode 时有效。超过参数值,则 patroni 会向 postmaster 发送 sigkill。
synchronous_mode: false # 打开同步复制模式。在此模式下,一个从库将被选择为同步模式的从库,只有最新的领导者和同步从库才能参与领导者选举。
synchronous_mode_strict # 如果没有可用的同步副本,则防止禁用同步复制,从而阻止所有客户端写入主服务器。
# https://patroni.readthedocs.io/en/latest/replication_modes.html
postgresql:
use_pg_rewind: true # 是否使用 pg_rewind
use_slots: true # 是否使用复制槽
parameters:
max_connections: 3000
superuser_reserved_connections: 100 # patroni 需要使用超级用户访问数据库才能正常运行
max_locks_per_transaction: 64
max_worker_processes: 2
max_prepared_transactions: 0
wal_level: logical
wal_log_hints: on
track_commit_timestamp: off
max_wal_senders: 10
max_replication_slots: 10
wal_keep_size: 4096mb # wal_keep_size = wal_keep_segments * wal_segment_size = 128 * 32mb
hot_standby: "on"
listen_addresses: "*"
port: 5432
cluster_name: "pg_cluster"
archive_mode: on
archive_command: "cp %p /enmo/pgarch/%f"
# standby_cluster: # 如果定义了这个部分,我们想要引导一个备用集群
# host: 127.0.0.1 # 远程主机的地址
# port: 1111 # 远程主机的端口
# primary_slot_name: patroni # 远程主服务器上用于复制的插槽。此参数是可选的,默认值来自实例名称
# create_replica_methods
# restore_command
# archive_cleanup_command
# recovery_min_apply_delay
initdb: # 列出要传递给 initdb 的选项
- data-checksums # 在9.3上需要 pg_rewind 时必须启用
- encoding: utf8 # 新数据库的默认编码
- locale: utf8 # 新数据库的默认语言环境
- wal-segsize: 32
- allow-group-access
- pgdata: /enmo/pgdata
- waldir: /enmo/pgwal
pg_hba: # 应该添加到 pg_hba.conf 的行列表
- host all all 0.0.0.0/0 md5
- host replication replicator 127.0.0.1/32 md5
users: # 初始化新集群后需要创建的一些额外用户
admin: # 用户名是admin
password: zalando
options: # create user 语句的选项列表
- createrole
- createdb
postgresql:
listen: 0.0.0.0:5432 # postgres 监听的 ip 地址 端口
connect_address: 192.168.58.10:5432
data_dir: /enmo/pgdata # postgres 数据目录的位置
# bin_dir: /software/pgsql/bin # postgresql 二进制文件的路径,默认值是一个空字符串,这意味着 path 环境变量将用于查找可执行文件
# config_dir: /software/pgsql/data # postgres 配置目录的位置,默认为 data_dir
pgpass: /home/postgres/.pgpass # 密码文件的路径
pg_ctl_timeout: 60 # pg_ctl 执行 start,stop 或 restart 时应等待多长时间. 默认值为 60 秒。
use_pg_rewind: true # 当它作为副本加入集群时,尝试在前领导者上使用 pg_rewind
remove_data_directory_on_rewind_failure: false # 如果启用此选项,patroni 将删除 postgresql 数据目录并重新创建副本。否则它会尝试跟随新的领导者。默认值为false
remove_data_directory_on_diverged_timelines: false # 如果 patroni 注意到时间线正在发散并且以前的 master 无法从新 master 开始流式传输,则 patroni 将删除 postgresql 数据目录并重新创建副本。此选项在 pg_rewind 无法使用时很有用。默认值为false。
authentication: # 用户验证
replication: # 复制用户
username: replica
password: replica
superuser: # 超级用户
username: postgres
password: postgres
# rewind: # pg_rewind 用户
# username:
# password:
tags:
nofailover: false # 控制是否允许此节点参与领导者竞赛并成为领导者
noloadbalance: false
clonefrom: false
nosync: false
3. 所有节点创建 patroni 服务并启动
# 创建服务,注意需要配置环境变量,否则启动服务会出现这种报错(fatal: patroni requires psycopg2>=2.5.4 or psycopg2-binary)
# requires,强制依赖 etcd.service 启动成功,可以视情况而定,etcd不启动,patroni起来后不会自动拉起数据库
[root@pgtest1 ~]# vi /usr/lib/systemd/system/patroni.service
[unit]
description=patroni
after=network.target remote-fs.target nss-lookup.target etcd.service
requires=etcd.service
[service]
type=forking
user=postgres
group=postgres
environment="pghome=/enmo/app/pgsql/13.3"
environment="pgdata=/enmo/pgdata"
environment="pgport=5432"
environment="ld_library_path=/enmo/app/pgsql/13.3/lib"
environment="path=/enmo/app/pgsql/13.3/bin:/usr/local/bin"
execstart=/bin/bash -c "patroni /enmo/app/patroni/patroni_config.yml >> /enmo/app/patroni/patroni.log 2>&1 &"
execreload=/bin/kill -s hup $mainpid
execstop=/usr/bin/killall patroni
killmode=process
timeoutsec=30
restart=no
[install]
wantedby=multi-user.target
# 启动服务
[root@pgtest1 ~]# systemctl daemon-reload
[root@pgtest1 ~]# systemctl restart patroni.service
[root@pgtest1 ~]# systemctl enable patroni.service
# 因为 patroni 会检测 postgresql 数据库是否正常运行,如果没有运行,会自动启动postgresql 数据库,所以可以考虑禁用 postgresql 服务,由 patroni 托管pg的启停
[root@pgtest1 ~]# systemctl disable postgres-13.service
注意:kill patroni 进程是否会宕库?
# 1. kill -9 patroni进程,不会宕库
[root@pgtest3 ~]# ps -ef |grep patroni
root 22695 22009 0 15:09 pts/1 00:00:00 tail -f /enmo/app/patroni/patroni.log
postgres 22754 1 0 15:13 ? 00:00:00 /usr/bin/python3 /usr/local/bin/patroni /enmo/app/patroni/patroni_config.yml
root 22790 22720 0 15:18 pts/3 00:00:00 grep --color=auto patroni
[root@pgtest3 ~]# kill -9 22754
# 2. killall patroni进程,会宕节点所在的数据库
[root@pgtest3 ~]# /usr/bin/killall patroni
# 3. 关闭 auto failover 功能,killall patroni进程,不会宕库
[root@pgtest3 ~]# patronictl -c /enmo/app/patroni/patroni_config.yml pause
success: cluster management is paused
[root@pgtest3 ~]# /usr/bin/killall patroni
在启动patroni之后,建议先使用patronictl禁掉auto failover功能,当启动完毕调试正常后再选择性的启动auto failover功能,因为启用auto failover功能后,killall patroni进程,会导致当前节点的数据库宕掉,如果主库处于生产状态,后果不堪设想。
# disable auto failover
# 如果没有启动 patroni,执行patronictl pause 会失败
[root@pgtest1 ~]# patronictl -c /enmo/app/patroni/patroni_config.yml pause
success: cluster management is paused
# resume auto failover
[root@pgtest1 ~]# patronictl -c /enmo/app/patroni/patroni_config.yml resume
success: cluster management is resumed
4. 所有节点设置patronictl别名,方便维护
[root@pgtest1 ~]# cat >> /etc/profile << eof
alias patronictl='patronictl -c /enmo/app/patroni/patroni_config.yml'
eof
[root@pgtest1 ~]# source /etc/profile
5. 查询 patroni 集群状态
[root@pgtest1 ~]# patronictl list
--------- --------------- --------- --------- ---- ----------- -----------------
| member | host | role | state | tl | lag in mb | pending restart |
cluster: pg_cluster (7025023477017500881) -- ---- ----------- -----------------
| pgtest1 | 192.168.58.10 | leader | running | 5 | | |
| pgtest2 | 192.168.58.11 | replica | running | 4 | 64 | * |
| pgtest3 | 192.168.58.12 | replica | running | 4 | 64 | * |
--------- --------------- --------- --------- ---- ----------- -----------------
至此,patroni部署完成,后面是patroni的管理部分,可选择性阅读
6. patroni 动态配置参数调整
6.1 使用 etcdctl 查看存储在etcd中的 patroni 动态配置参数
etcdctl 返回的结果是json格式,为了方便查看,可以下载安装jq命令
jq 的下载地址: jq 1.6 binaries for 64-bit or 32-bit.
上传服务器,修改可执行权限,chmod x /enmo/soft/jq-linux64
[root@pgtest1 ~]# export etcdctl_api=2
[root@pgtest1 ~]# etcdctl get /service/pg_cluster/config |/enmo/soft/jq-linux64 .
{
"ttl": 30,
"loop_wait": 10,
"retry_timeout": 10,
"maximum_lag_on_failover": 1048576,
"max_timelines_history": 0,
"master_start_timeout": 300,
"master_stop_timeout": 0,
"synchronous_mode": false,
"postgresql": {
"use_pg_rewind": true,
"use_slots": true,
"parameters": {
"max_connections": 3000,
"superuser_reserved_connections": 100,
"max_locks_per_transaction": 64,
"max_worker_processes": 2,
"max_prepared_transactions": 0,
"wal_level": "logical",
"wal_log_hints": true,
"track_commit_timestamp": false,
"max_wal_senders": 10,
"max_replication_slots": 10,
"wal_keep_size": "4096mb",
"hot_standby": "on",
"listen_addresses": "*",
"port": 5432,
"cluster_name": "pg_cluster",
"archive_mode": true,
"archive_command": "cp %p /enmo/pgarch/%f"
}
}
}
6.2 patronictl 查看 patroni 动态配置参数
[root@pgtest1 ~]# patronictl show-config
loop_wait: 10
master_start_timeout: 300
master_stop_timeout: 0
max_timelines_history: 0
maximum_lag_on_failover: 1048576
postgresql:
parameters:
archive_command: cp %p /enmo/pgarch/%f
archive_mode: true
cluster_name: pg_cluster
hot_standby: 'on'
listen_addresses: '*'
max_connections: '3000'
max_locks_per_transaction: 64
max_prepared_transactions: 0
max_replication_slots: 10
max_wal_senders: 10
max_worker_processes: 2
port: 5432
superuser_reserved_connections: 100
track_commit_timestamp: false
wal_keep_size: 4096mb
wal_level: logical
wal_log_hints: true
use_pg_rewind: true
use_slots: true
retry_timeout: 10
synchronous_mode: false
ttl: 30
6.3 通过 patronictl 调整配置参数,在其中一个节点调整,其他节点也会自动调整,并且 patroni 自动进行 reload 操作
[postgres@pgtest1 ~]$ patronictl edit-config
# 编辑文本
apply these changes? [y/n]: y
configuration changed
6.4 对于需要重启数据库生效的参数,为了减少对生产的影响,可以逐个节点重启,也可以在停机窗口通过 patronictl restart 对整个集群进行重启
# 仅重启当前节点(--any)
[root@pgtest1 ~]# patronictl restart pg_cluster --any
--------- --------------- --------- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
cluster: pg_cluster (7025023477017500881) -- ---- -----------
| pgtest1 | 192.168.58.10 | replica | running | 6 | 0 |
| pgtest2 | 192.168.58.11 | leader | running | 6 | |
| pgtest3 | 192.168.58.12 | replica | running | 6 | 0 |
--------- --------------- --------- --------- ---- -----------
when should the restart take place (e.g. 2021-10-31t21:00) [now]:
are you sure you want to restart members pgtest1, pgtest2, pgtest3? [y/n]: y
restart if the postgresql version is less than provided (e.g. 9.5.2) []:
success: restart on member pgtest1
# 如果节点是 pending 状态的,才会执行重启操作
[root@pgtest1 ~]# patronictl restart pg_cluster --any --pending
--------- --------------- --------- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
cluster: pg_cluster (7025023477017500881) -- ---- -----------
| pgtest1 | 192.168.58.10 | replica | running | 6 | 0 |
| pgtest2 | 192.168.58.11 | leader | running | 6 | |
| pgtest3 | 192.168.58.12 | replica | running | 6 | 0 |
--------- --------------- --------- --------- ---- -----------
when should the restart take place (e.g. 2021-10-31t21:01) [now]:
are you sure you want to restart members pgtest1, pgtest2, pgtest3? [y/n]: y
restart if the postgresql version is less than provided (e.g. 9.5.2) []:
failed: restart for member pgtest2, status code=503, (restart conditions are not satisfied)
# 重启所有成员
[root@pgtest1 ~]# patronictl restart pg_cluster
--------- --------------- --------- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
cluster: pg_cluster (7025023477017500881) -- ---- -----------
| pgtest1 | 192.168.58.10 | replica | running | 6 | 0 |
| pgtest2 | 192.168.58.11 | leader | running | 6 | |
| pgtest3 | 192.168.58.12 | replica | running | 6 | 0 |
--------- --------------- --------- --------- ---- -----------
when should the restart take place (e.g. 2021-10-31t21:02) [now]:
are you sure you want to restart members pgtest1, pgtest2, pgtest3? [y/n]: y
restart if the postgresql version is less than provided (e.g. 9.5.2) []:
success: restart on member pgtest1
success: restart on member pgtest2
success: restart on member pgtest3
6.5 如何仅调整指定节点参数
上述提到了postgresql的配置参数统一由patroni管理,单独修改节点的 postgresql.conf 参数不再生效。
如果想差异化配置其中一个节点的参数,可通过 alter system 命令动态配置,如下:
[root@pgtest1 ~]# psql
postgres=# show wal_keep_segments ;
postgres=# alter system set wal_keep_segments =200;
postgres=# select pg_reload_conf();
postgres=# show wal_keep_segments ;
7. 调用浏览器网页查看集群状态
linux curl命令需要提前安装 jq 命令格式化显示结果,windows 谷歌浏览器需要安装扩展插件 jsonview,否则显示结果不好看
jsonview 的下载地址
jsonview 参考
获取当前版本的动态配置:
[root@pgtest1 ~]# curl -s "http://192.168.58.10:8008/cluster" |/enmo/soft/jq-linux64 .
{
"members": [
{
"name": "pgtest1",
"role": "leader",
"state": "running",
"api_url": "http://192.168.58.10:8008/patroni",
"host": "192.168.58.10",
"port": 5432,
"timeline": 5
},
{
"name": "pgtest2",
"role": "replica",
"state": "running",
"api_url": "http://192.168.58.11:8008/patroni",
"host": "192.168.58.11",
"port": 5432,
"timeline": 5,
"pending_restart": true,
"lag": 0
},
{
"name": "pgtest3",
"role": "replica",
"state": "running",
"api_url": "http://192.168.58.12:8008/patroni",
"host": "192.168.58.12",
"port": 5432,
"timeline": 5,
"pending_restart": true,
"lag": 0
}
]
}
8. patroni rest api
8.1 监控 monitoring endpoint
[root@pgtest1 ~]# curl -s "http://192.168.58.10:8008/patroni" |/enmo/soft/jq-linux64 .
8.2 集群状态 cluster status endpoints
[root@pgtest1 ~]# curl -s "http://192.168.58.10:8008/cluster" |/enmo/soft/jq-linux64 .
[root@pgtest1 ~]# curl -s "http://192.168.58.10:8008/history" |/enmo/soft/jq-linux64 .
8.3 配置 config endpoint
- 获取当前版本的动态配置
[root@pgtest1 ~]# curl -s "http://192.168.58.10:8008/config" |/enmo/soft/jq-linux64 .
- 更改现有配置
[root@pgtest1 ~]# curl -s -xpatch -d '{"loop_wait":9,"ttl":25,"postgresql":{"parameters":{"max_connections":"2500"}}}' http://192.168.58.10:8008/config |/enmo/soft/jq-linux64 .
{
"ttl": 25,
"loop_wait": 9,
"retry_timeout": 10,
"maximum_lag_on_failover": 1048576,
"max_timelines_history": 0,
"master_start_timeout": 300,
"master_stop_timeout": 0,
"synchronous_mode": false,
"postgresql": {
"use_pg_rewind": true,
"use_slots": true,
"parameters": {
"max_connections": "2500",
"superuser_reserved_connections": 100,
"max_locks_per_transaction": 64,
"max_worker_processes": 2,
"max_prepared_transactions": 0,
"wal_level": "logical",
"wal_log_hints": true,
"track_commit_timestamp": false,
"max_wal_senders": 10,
"max_replication_slots": 10,
"wal_keep_size": "4096mb",
"hot_standby": "on",
"listen_addresses": "*",
"port": 5432,
"cluster_name": "pg_cluster",
"archive_mode": true,
"archive_command": "cp %p /enmo/pgarch/%f"
}
}
}
此时 patronictl 显示有需要重启生效的参数被修改(pending restart)
[root@pgtest1 ~]# patronictl restart pg_cluster
--------- --------------- --------- --------- ---- ----------- -----------------
| member | host | role | state | tl | lag in mb | pending restart |
cluster: pg_cluster (7025023477017500881) -- ---- ----------- -----------------
| pgtest1 | 192.168.58.10 | leader | running | 5 | | * |
| pgtest2 | 192.168.58.11 | replica | running | 5 | 0 | * |
| pgtest3 | 192.168.58.12 | replica | running | 5 | 0 | * |
--------- --------------- --------- --------- ---- ----------- -----------------
- 重启数据库使参数生效
[root@pgtest1 ~]# patronictl restart pg_cluster
--------- --------------- --------- --------- ---- ----------- -----------------
| member | host | role | state | tl | lag in mb | pending restart |
cluster: pg_cluster (7025023477017500881) -- ---- ----------- -----------------
| pgtest1 | 192.168.58.10 | leader | running | 5 | | * |
| pgtest2 | 192.168.58.11 | replica | running | 5 | 0 | * |
| pgtest3 | 192.168.58.12 | replica | running | 5 | 0 | * |
--------- --------------- --------- --------- ---- ----------- -----------------
when should the restart take place (e.g. 2021-10-31t18:55) [now]:
are you sure you want to restart members pgtest1, pgtest2, pgtest3? [y/n]: y
restart if the postgresql version is less than provided (e.g. 9.5.2) []:
success: restart on member pgtest1
success: restart on member pgtest2
success: restart on member pgtest3
[root@pgtest1 ~]# patronictl list
--------- --------------- --------- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
cluster: pg_cluster (7025023477017500881) -- ---- -----------
| pgtest1 | 192.168.58.10 | leader | running | 5 | |
| pgtest2 | 192.168.58.11 | replica | running | 5 | 0 |
| pgtest3 | 192.168.58.12 | replica | running | 5 | 0 |
--------- --------------- --------- --------- ---- -----------
- 删除(重置)某些设置 (null)
$ curl -s -xpatch -d '{"postgresql":{"parameters":{"max_connections":null}}}' http://192.168.58.10:8008/config |/enmo/soft/jq-linux64 .
8.4 switchover and failover endpoints
- failover endpoints 允许在没有健康节点时执行手动 failover ,但同时它不允许 switchover 。
- switchover endpoint 则相反。它仅在集群健康(有leader)时才起作用,并允许在指定时间安排切换。
- 执行到特定节点的 failover,在节点都正常的情况下,执行 failover 实际上和执行 switchover 一样
[root@pgtest1 ~]# patronictl list
--------- --------------- --------- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
cluster: pg_cluster (7025023477017500881) -- ---- -----------
| pgtest1 | 192.168.58.10 | leader | running | 5 | |
| pgtest2 | 192.168.58.11 | replica | running | 5 | 0 |
| pgtest3 | 192.168.58.12 | replica | running | 5 | 0 |
--------- --------------- --------- --------- ---- -----------
[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/failover -xpost -d '{"candidate":"pgtest2"}'
successfully failed over to "pgtest2"
[root@pgtest1 ~]# patronictl list
--------- --------------- --------- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
cluster: pg_cluster (7025023477017500881) -- ---- -----------
| pgtest1 | 192.168.58.10 | replica | running | 6 | 0 |
| pgtest2 | 192.168.58.11 | leader | running | 6 | |
| pgtest3 | 192.168.58.12 | replica | running | 6 | 0 |
--------- --------------- --------- --------- ---- -----------
[root@pgtest1 ~]# psql
psql (13.3)
type "help" for help.
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
t
(1 row)
- 在指定时间从 leader 到集群中任何其他健康节点的 switchover
[root@pgtest1 ~]# patronictl list
--------- --------------- --------- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
cluster: pg_cluster (7025023477017500881) -- ---- -----------
| pgtest1 | 192.168.58.10 | replica | running | 6 | 0 |
| pgtest2 | 192.168.58.11 | leader | running | 6 | |
| pgtest3 | 192.168.58.12 | replica | running | 6 | 0 |
--------- --------------- --------- --------- ---- -----------
# switchover 必须至少指定 leader 或 candidate 字段和可选的 scheduled_at 字段。
# 2021-10-31t18:42 08 年-月-日t时:分 时区( 08)
[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/switchover -xpost -d '{"leader":"pgtest2","scheduled_at":"2021-10-31t18:42 08"}'
switchover scheduled
# patronictl list 会显示 switchover scheduled at
[root@pgtest1 ~]# patronictl list
--------- --------------- --------- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
cluster: pg_cluster (7025023477017500881) -- ---- -----------
| pgtest1 | 192.168.58.10 | replica | running | 6 | 0 |
| pgtest2 | 192.168.58.11 | leader | running | 6 | |
| pgtest3 | 192.168.58.12 | replica | running | 6 | 0 |
--------- --------------- --------- --------- ---- -----------
switchover scheduled at: 2021-10-31t18:42:00 00:00
from: pgtest2
# patroni 日志里也有倒计时提醒
2021-10-31 18:40:51,002 info: no action. i am (pgtest2) the leader with the lock
2021-10-31 18:40:59,995 info: lock owner: pgtest2; i am pgtest2
2021-10-31 18:40:59,997 info: awaiting failover at 2021-10-31t18:42:00 00:00 (in 28860 seconds)
2021-10-31 18:41:00,000 info: no action. i am (pgtest2) the leader with the lock
2021-10-31 18:41:08,996 info: lock owner: pgtest2; i am pgtest2
2021-10-31 18:41:08,998 info: awaiting failover at 2021-10-31t18:42:00 00:00 (in 28851 seconds)
# 删除定时切换任务
[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/switchover -xdelete
scheduled switchover deleted
[root@pgtest1 ~]# patronictl list
--------- --------------- --------- --------- ---- -----------
| member | host | role | state | tl | lag in mb |
cluster: pg_cluster (7025023477017500881) -- ---- -----------
| pgtest1 | 192.168.58.10 | replica | running | 6 | 0 |
| pgtest2 | 192.168.58.11 | leader | running | 6 | |
| pgtest3 | 192.168.58.12 | replica | running | 6 | 0 |
--------- --------------- --------- --------- ---- -----------
post /switchover 和 post /failover 分别对应 patientictl switchover 和 patonictl failover 命令。delete /switchover 对应 patronictl flush
# switchover
[postgres@pgtest1 ~]$ patronictl switchover
master [pgtest1]:
candidate ['pgtest2', 'pgtest3'] []: pgtest2
when should the switchover take place (e.g. 2021-10-28t04:45 ) [now]:
current cluster topology
... ...
are you sure you want to switchover cluster pg_cluster, demoting current master pgtest1? [y/n]: y
2021-10-28 03:45:35.91763 successfully switched over to "pgtest2"
... ...
# failover
[postgres@pgtest1 ~]$ patronictl failover
candidate ['pgtest1', 'pgtest3'] []: pgtest1
current cluster topology
... ...
are you sure you want to failover cluster pg_cluster, demoting current master pgtest2? [y/n]: y
2021-10-28 03:47:56.13486 successfully failed over to "pgtest1"
... ...
8.5 重启 restart endpoint
- post /restart 提供了几个参数
restart_pending: boolean, if set to true patroni will restart postgresql only when restart is pending in order to apply some changes in the postgresql config.
role: perform restart only if the current role of the node matches with the role from the post request.
postgres_version: perform restart only if the current version of postgres is smaller than specified in the post request.
timeout: how long we should wait before postgresql starts accepting connections. overrides master_start_timeout.
schedule: timestamp with time zone, schedule the restart somewhere in the future.
- post /restart 仅重启当前节点的数据库
[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/restart -xpost
restarted successfully
- post /restart 定时重启当前节点的数据库
[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/restart -xpost -d '{"schedule":"2021-10-31t19:18 08"}'
restart scheduled
[root@pgtest1 ~]# patronictl list
--------- --------------- --------- --------- ---- ----------- ---------------------------
| member | host | role | state | tl | lag in mb | scheduled restart |
cluster: pg_cluster (7025023477017500881) -- ---- ----------- ---------------------------
| pgtest1 | 192.168.58.10 | replica | running | 6 | 0 | 2021-10-31t19:18:00 08:00 |
| pgtest2 | 192.168.58.11 | leader | running | 6 | | |
| pgtest3 | 192.168.58.12 | replica | running | 6 | 0 | |
--------- --------------- --------- --------- ---- ----------- ---------------------------
- delete /restart 删除定时重启任务
[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/restart -xdelete
post /restart and delete /restart endpoints are used by patronictl restart and patronictl flush
8.6 重新加载 reload endpoint
post /reload 让 patroni 重新读取和应用配置文件。这相当于向 patroni 进程发送 sighup 信号。如果您更改了一些需要重新启动 postgres 的参数(如 shared_buffers),您仍然必须通过调用 post /restart 或使用 patriotictl restart 明确地重新启动postgres。
[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/reload -xpost
reload scheduled
the reload endpoint is used by patronictl reload
[postgres@pgtest1 ~]$ patronictl reload pg_cluster
... ...
are you sure you want to reload members pgtest3, pgtest2, pgtest1? [y/n]: y
reload request received for member pgtest3 and will be processed within 10 seconds
reload request received for member pgtest2 and will be processed within 10 seconds
reload request received for member pgtest1 and will be processed within 10 seconds
8.7 重新初始化 reinitialize endpoint
只允许在从节点上执行,一旦调用,它将删除数据目录并启动 pg_basebackup 重新初始化指定节点上的 postgresql 数据目录。
the call might fail if patroni is in a loop trying to recover (restart) a failed postgres. in order to overcome this problem one can specify {“force”:true} in the request body.
[root@pgtest1 ~]# curl -s http://192.168.58.10:8008/reinitialize -xpost
reinitialize started
the reinitialize endpoint is used by patronictl reinit.
[postgres@pgtest1 ~]$ patronictl reinit pg_cluster
... ...
which member do you want to reinitialize [pgtest3, pgtest2, pgtest1]? []: pgtest3
are you sure you want to reinitialize members pgtest3? [y/n]: y
success: reinitialize for member pgtest3
9. 使用 patronictl 执行数据库查询操作
[postgres@pgtest1 ~]$ cat aa.sql
select * from test_1;
[postgres@pgtest1 ~]$ patronictl -c /enmo/app/patroni/pg_test01.yml query -f aa.sql --password
password:
id create_time
1 2021-10-16 17:47:34
2 2021-10-16 17:55:06
[postgres@pgtest1 ~]$ patronictl -c /enmo/app/patroni/pg_test01.yml query -c "select * from test_1;" --password
password:
id create_time
1 2021-10-16 17:47:34
2 2021-10-16 17:55:06
10. 其他参考官方文档
patroni 官方文档: