大家好,这次大表哥带来技术分享是 pgpool ii 这个连接池自身的ha方案: watch dog.
本文参考m6米乐安卓版下载官网的 configuration examples:
https://tatsuo-ishii.github.io/pgpool-ii/current/example-cluster.html
watch dog 是pgpool 的内置的监控进程不需要独立安装,打开相关的参数开关即可。
具体的配置步骤如下:
1)官方建议配置至少3台(奇数个)几点配置watch dog 进程。
ip | pg pool role | pg instance role |
---|---|---|
10.67.38.50 | leader | primary database |
10.67.39.149 | standby | standby database |
10.67.39.49 | standby | standy database |
2)手动创建 pgpool_node_id 文件, 在路径下:${pgpool_home}/etc
这一步是必须的,否则启动会报错:if watchdog is enable, pgpool_node_id file is required
infra [root@wqdcsrv3352 etc]# /opt/postgresql/pgpool4.3/bin/pgpool 2022-06-30 14:53:25.867: main pid 69061: 2022-06-30 14:53:25fatal: pgpool node id file /opt/postgresql/pgpool4.3/etc/pgpool_node_id does not exist 2022-06-30 14:53:25.867: main pid 69061: 2022-06-30 14:53:25detail: if watchdog is enable, pgpool_node_id file is required
infra [postgres@wqdcsrv3352 pgpool4.3]# vi pgpool_node_id infra [postgres@wqdcsrv3352 pgpool4.3]# cat pgpool_node_id 0 infra [postgres@wqdcsrv3354 pgpool]# vi pgpool_node_id infra [postgres@wqdcsrv3354 pgpool]# cat pgpool_node_id 1 infra [postgres@wqdcsrv3353 pgpool]# vi pgpool_node_id infra [postgres@wqdcsrv3353 pgpool]# cat pgpool_node_id 2
3)关于watch dog 核心参数的配置: 配置文件 ${pgpool_home}/etc/pgpool.conf
#------------------------------------------------------------------------------
# watchdog
#------------------------------------------------------------------------------
## watch dog的开关
use_watchdog = on
## watch 的节点 ip,port 信息
hostname0 = '10.67.38.50'
wd_port0 = 9000
pgpool_port0 = 9999
hostname1 = '10.67.39.149'
wd_port1 = 9000
pgpool_port1 = 9999
hostname2 = '10.67.39.49'
wd_port2 = 9000
pgpool_port2 = 9999
### 优先级的设置 这里我们设置成一致的 为1 ,
### 没有设置加密方式
### 进程文件设置在了 /tmp下面
wd_priority = 1
wd_authkey = ''
wd_ipc_socket_dir = '/tmp'
###vip相关的设置 需要根据实际的 网卡设备名称 eth0
### 这个我们pgpool 是通过 pg pool 启动的,所以 不需要 sudo 可以直接执行, 如果是 postgres等普通权限的用户,需要添加 /sbin/sudo
delegate_ip = '10.67.39.200'
if_up_cmd = '/sbin/ip addr add $_ip_$/24 dev eth0 label eth0:0'
if_down_cmd = '/sbin/ip addr del $_ip_$/24 dev eth0'
arping_cmd = '/usr/sbin/arping -u $_ip_$ -w 1 -i eth0'
wd_escalation_command = '/opt/postgresql/pgpool4.3/etc/escalation.sh' ### 这个脚本的执行发生在挂上vip之前,需要把所有的节点vip 卸载一遍,以防止脑裂的发生
###watch dog 的心跳方式配置:
###通过网卡端口监控, 10秒为超时
wd_lifecheck_method = 'heartbeat'
wd_interval = 10
heartbeat_hostname0 = '10.67.38.50'
heartbeat_port0 = 9694
heartbeat_device0 = 'eth0'
heartbeat_hostname1 = '10.67.39.149'
heartbeat_port1 = 9694
heartbeat_device1 = 'eth0'
heartbeat_hostname2 = '10.67.39.49'
heartbeat_port2 = 9694
heartbeat_device2 = 'eth0'
wd_heartbeat_keepalive = 2
wd_heartbeat_deadtime = 30
- 需要修改一下 escalation.sh 这个shell 文件
修改成实际节点的ip和vip
#!/bin/bash
# this script is run by wd_escalation_command to bring down the virtual ip on other pgpool nodes
# before bringing up the virtual ip on the new active pgpool node.
set -o xtrace
pgpools=(10.67.38.50 10.67.39.149 10.67.39.49)
hostname=`hostname -i`
vip=10.67.39.200
device=eth0
for pgpool in "${pgpools[@]}"; do
[ "$hostname" = "$pgpool" ] && continue
ssh -t -o stricthostkeychecking=no -o userknownhostsfile=/dev/null postgres@$pgpool -i ~/.ssh/id_rsa_pgpool "
/usr/bin/sudo /sbin/ip addr del $vip/24 dev $device
"
done
exit 0
5)依次启动3个节点的 watch dog (这里我们用大权限的账户 root ), 观察 watch dog的 节点状态
infra [root@wqdcsrv3352 etc]# /opt/postgresql/pgpool4.3/bin/pgpool infra [root@wqdcsrv3353 etc]# /opt/postgresql/pgpool4.3/bin/pgpool infra [root@wqdcsrv3354 etc]# /opt/postgresql/pgpool4.3/bin/pgpool infra [postgres@wqdcsrv3352 etc]# /opt/postgresql/pgpool4.3/bin/pcp_watchdog_info -h localhost -p 9898 -u postgres password: 3 3 yes 10.67.38.50:9999 linux wqdcsrv3352.cn.infra 10.67.38.50 10.67.38.50:9999 linux wqdcsrv3352.cn.infra 10.67.38.50 9999 9000 4 leader 0 member 10.67.39.149:9999 linux wqdcsrv3354.cn.infra 10.67.39.149 9999 9000 7 standby 0 member 10.67.39.49:9999 linux wqdcsrv3353.cn.infra 10.67.39.49 9999 9000 7 standby 0 member
6)手动测试vip 漂移
我们尝试关闭 leader 节点上的pgpool : 10.67.38.50
infra [root@wqdcsrv3352 ~]# /opt/postgresql/pgpool4.3/bin/pgpool -m fast stop 2022-06-29 17:34:33.552: main pid 76158: 2022-06-29 17:34:33log: stop request sent to pgpool (pid: 73720). waiting for termination... .done.
我们可以从pgpool 的日志中,看到vip已经被移除了
2022-06-29 17:34:33.552: main pid 73720: 2022-06-29 17:34:33log: terminating all child processes 2022-06-29 17:34:33.556: watchdog pid 73723: 2022-06-29 17:34:33log: watchdog is shutting down 2022-06-29 17:34:33.556: watchdog_utility pid 76159: 2022-06-29 17:34:33log: watchdog: de-escalation started 2022-06-29 17:34:33.558: watchdog_utility pid 76159: 2022-06-29 17:34:33log: successfully released the delegate ip:"10.67.39.200" 2022-06-29 17:34:33.558: watchdog_utility pid 76159: 2022-06-29 17:34:33detail: 'if_down_cmd' returned with success 2022-06-29 17:34:33.559: main pid 73720: 2022-06-29 17:34:33log: pgpool-ii system is shutdown
infra [root@wqdcsrv3352 ~]# ip addr | grep 10.67.39.200
我们这个时候发现vip 10.67.39.200 已经漂移到了节点 149:
infra [root@wqdcsrv3354 ~]# ip addr |grep 10.67.39.200 inet 10.67.39.200/24 scope global eth0:0
此时我们再次观察pgpool 集群的状态:
infra [root@wqdcsrv3354 ~]# /opt/postgresql/pgpool4.3/bin/pcp_watchdog_info -h localhost -p 9898 -u postgres password: 3 3 yes 10.67.39.149:9999 linux wqdcsrv3354.cn.infra 10.67.39.149 10.67.39.149:9999 linux wqdcsrv3354.cn.infra 10.67.39.149 9999 9000 4 leader 0 member 10.67.38.50:9999 linux wqdcsrv3352.cn.infra 10.67.38.50 9999 9000 10 shutdown 0 member 10.67.39.49:9999 linux wqdcsrv3353.cn.infra 10.67.39.49 9999 9000 7 standby 0 member
我们手动起来 10.67.38.50 节点上的 pgpool:
infra [root@wqdcsrv3352 ~]# /opt/postgresql/pgpool4.3/bin/pgpool -f /opt/postgresql/pgpool4.3/etc/pgpool.conf -f /opt/postgresql/pgpool4.3/etc/pcp.conf > /tmp/pgpool.log
这个时候10.67.38.50 的节点 恢复成了 standby 的状态:
3 3 yes 10.67.39.149:9999 linux wqdcsrv3354.cn.infra 10.67.39.149 10.67.39.149:9999 linux wqdcsrv3354.cn.infra 10.67.39.149 9999 9000 4 leader 0 member 10.67.38.50:9999 linux wqdcsrv3352.cn.infra 10.67.38.50 9999 9000 7 standby 0 member 10.67.39.49:9999 linux wqdcsrv3353.cn.infra 10.67.39.49 9999 9000 7 standby 0 member
下面我们来用springboot的小程序测试一下 pgpool 的ha :
我们的程序的数据库配置是 这次是连接到 vip 10.67.39.200 上面
连接池的配置是 最大允许 200 个连接,最小空闲的连接数是10
//数据库的地址以及端口号 spring.datasource.url=jdbc:postgresql://10.67.39.200:9999/postgres spring.datasource.username=app_ha_user spring.datasource.password=app_ha_user spring.datasource.driverclassname=org.postgresql.driver spring.jpa.properties.hibernate.dialect=org.hibernate.dialect.postgresqldialect spring.jpa.properties.hibernate.hbm2ddl.auto=update spring.datasource.maximum-pool-size=200 spring.datasource.min-idle=10
测试生成很简单,就是网页输出一下数据库的ip地址:
string sql = "select inet_server_addr()"; // 通过jdbctemplate查询数据库 string hostname = (string)jdbctemplate.queryforobject( sql, string.class); return "hello ,you are connecting to " hostname;
启动测试程序后,观察初始化的连接池:
postgres=> show pool_processes; pool_pid | start_time | client_connection_count | database | username | backend_connection_time | pool_counter | status ---------- ------------------------------------------------------ ------------------------- ---------- ------------- ------------------------- -------------- --------------------- 80665 | 2022-06-29 17:30:21 | 0 | | | | | wait for connection 80666 | 2022-06-29 17:30:21 | 0 | postgres | app_ha_user | 2022-06-30 11:29:50 | 1 | idle 64166 | 2022-06-30 11:11:51 | 2 | | | | | wait for connection 80668 | 2022-06-29 17:30:21 | 0 | postgres | app_ha_user | 2022-06-30 11:29:51 | 1 | idle 80669 | 2022-06-29 17:30:21 | 1 | | | | | wait for connection 65854 | 2022-06-30 11:27:54 | 0 | | | | | wait for connection 80671 | 2022-06-29 17:30:21 | 0 | postgres | app_ha_user | 2022-06-30 11:29:50 | 1 | idle 80672 | 2022-06-29 17:30:21 | 0 | postgres | app_ha_user | 2022-06-30 11:29:50 | 1 | idle 80673 | 2022-06-29 17:30:21 | 0 | | | | | wait for connection 80674 | 2022-06-29 17:30:21 | 0 | postgres | app_ha_user | 2022-06-30 11:29:51 | 1 | idle 80675 | 2022-06-29 17:30:21 | 0 | | | | | wait for connection 80676 | 2022-06-29 17:30:21 | 0 | | | | | wait for connection 80677 | 2022-06-29 17:30:21 | 0 | postgres | app_ha_user | 2022-06-30 11:29:51 | 1 | idle 80678 | 2022-06-29 17:30:21 | 1 | postgres | app_ha_user | 2022-06-30 11:29:50 | 1 | idle 80679 | 2022-06-29 17:30:21 | 0 | postgres | app_ha_user | 2022-06-30 11:29:50 | 1 | idle 80680 | 2022-06-29 17:30:21 (4:15 before process restarting) | 1 | | | | | wait for connection 80681 | 2022-06-29 17:30:21 | 0 | | | | | wait for connection 80682 | 2022-06-29 17:30:21 | 0 | postgres | app_ha_user | 2022-06-30 11:29:51 | 1 | idle 80683 | 2022-06-29 17:30:21 | 0 | | | | | wait for connection 80684 | 2022-06-29 17:30:21 | 2 | | | | | wait for connection 80685 | 2022-06-29 17:30:21 | 0 | | | | | wait for connection 80686 | 2022-06-29 17:30:21 | 0 | | | | | wait for connection 80687 | 2022-06-29 17:30:21 | 0 | | | | | wait for connection 80688 | 2022-06-29 17:30:21 | 0 | | | | | wait for connection 80689 | 2022-06-29 17:30:21 | 0 | | | | | wait for connection 80690 | 2022-06-29 17:30:21 | 0 | | | | | wait for connection 80691 | 2022-06-29 17:30:21 | 0 | | | | | wait for connection 80692 | 2022-06-29 17:30:21 | 0 | | | | | wait for connection 80693 | 2022-06-29 17:30:21 | 0 | postgres | app_ha_user | 2022-06-30 11:29:51 | 1 | idle 80694 | 2022-06-29 17:30:21 | 0 | | | | | wait for connection 80695 | 2022-06-29 17:30:21 | 0 | postgres | app_ha_user | 2022-06-30 11:30:30 | 1 | execute command 80696 | 2022-06-29 17:30:21 | 0 | | | | | wait for connection
尝试访问web界面: http://127.0.0.1:8066/ 我们可以看到 50这个节点是主库
关闭vip所在节点的pgpool :
infra [root@wqdcsrv3354 ~]# ip addr 1: lo:
mtu 65536 qdisc noqueue state unknown group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc mq state up group default qlen 1000 link/ether 00:50:56:ae:f7:e7 brd ff:ff:ff:ff:ff:ff inet 10.67.39.149/22 brd 10.67.39.255 scope global eth0 valid_lft forever preferred_lft forever inet 10.67.39.200/24 scope global eth0:0 valid_lft forever preferred_lft forever infra [root@wqdcsrv3354 ~]# /opt/postgresql/pgpool4.3/bin/pgpool -m fast stop 2022-06-30 11:34:08.821: main pid 66716: 2022-06-30 11:34:08log: stop request sent to pgpool (pid: 80655). waiting for termination... .done.
这个时候我们看到vip 漂移到了 49这个节点:
infra [postgres@wqdcsrv3353 ~]# ip addr 1: lo:
mtu 65536 qdisc noqueue state unknown group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc mq state up group default qlen 1000 link/ether 00:50:56:ae:99:07 brd ff:ff:ff:ff:ff:ff inet 10.67.39.49/22 brd 10.67.39.255 scope global eth0 valid_lft forever preferred_lft forever inet 10.67.39.200/24 scope global eth0:0 valid_lft forever preferred_lft forever
再次访问网页: http://127.0.0.1:8066/ 我们可以看到通过vip 依然可以连接到 主库 50 这个节点.
我们可以观察到 测试程序的连接池进行了重连
2022-06-30 11:39:01,161 warn (poolbase.java:184)- hikaripool-1 - failed to validate connection org.postgresql.jdbc.pgconnection@51849c83 (this connection has been closed.). possibly consider using a shorter maxlifetime value. 2022-06-30 11:39:01,164 warn (poolbase.java:184)- hikaripool-1 - failed to validate connection org.postgresql.jdbc.pgconnection@46432d85 (this connection has been closed.). possibly consider using a shorter maxlifetime value. 2022-06-30 11:39:01,168 warn (poolbase.java:184)- hikaripool-1 - failed to validate connection org.postgresql.jdbc.pgconnection@33d6c14a (this connection has been closed.). possibly consider using a shorter maxlifetime value. 2022-06-30 11:39:01,169 warn (poolbase.java:184)- hikaripool-1 - failed to validate connection org.postgresql.jdbc.pgconnection@64116a2c (this connection has been closed.). possibly consider using a shorter maxlifetime value. 2022-06-30 11:39:01,169 warn (poolbase.java:184)- hikaripool-1 - failed to validate connection org.postgresql.jdbc.pgconnection@6641e508 (this connection has been closed.). possibly consider using a shorter maxlifetime value. 2022-06-30 11:39:01,170 warn (poolbase.java:184)- hikaripool-1 - failed to validate connection org.postgresql.jdbc.pgconnection@3931c340 (this connection has been closed.). possibly consider using a shorter maxlifetime value. 2022-06-30 11:39:01,170 warn (poolbase.java:184)- hikaripool-1 - failed to validate connection org.postgresql.jdbc.pgconnection@3ce889bd (this connection has been closed.). possibly consider using a shorter maxlifetime value. 2022-06-30 11:39:01,171 warn (poolbase.java:184)- hikaripool-1 - failed to validate connection org.postgresql.jdbc.pgconnection@5347c406 (this connection has been closed.). possibly consider using a shorter maxlifetime value. 2022-06-30 11:39:01,172 warn (poolbase.java:184)- hikaripool-1 - failed to validate connection org.postgresql.jdbc.pgconnection@3a4712c3 (this connection has been closed.). possibly consider using a shorter maxlifetime value. 2022-06-30 11:39:01,172 warn (poolbase.java:184)- hikaripool-1 - failed to validate connection org.postgresql.jdbc.pgconnection@ddbed08 (this connection has been closed.). possibly consider using a shorter maxlifetime value.