repmgr搭建参考上一篇 https://www.modb.pro/db/15358
$ vi /opt/repmgr.conf
failover='automatic'
connection_check_type=ping
reconnect_attempts=6
reconnect_interval=10
#如果主节点失败,则要进行切换的工作,并记录到日志
promote_command='/opt/pgsql/bin/repmgr standby promote -f /opt/repmgr.conf --log-to-file'
#如果有连接到主节点的其他从节点,会根据命令连接到新主的节点
follow_command='/opt/pgsql/bin/repmgr standby follow -f /opt/repmgr.conf --log-to-file --upstream-node-id=%n'
$ vi data6000/postgresql.conf
shared_preload_libraries = 'repmgr'
$ /opt/pgsql/bin/repmgrd -f /opt/repmgr.conf
--通过事件记录查看repmgrd的开启情况
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster event --event=repmgrd_start
node id | name | event | ok | timestamp | details
--------- ------- --------------- ---- --------------------- --------------------------------------------------------
1 | node1 | repmgrd_start | t | 2020-02-02 22:33:30 | monitoring cluster primary "node1" (id: 1)
2 | node2 | repmgrd_start | t | 2020-02-02 22:33:25 | monitoring connection to upstream node "node1" (id: 1)
--查看repmgr和repmgrd进程
[postgres@ha1 ~]$ ps f -u postgres
pid tty stat time command
13124 pts/0 s 0:00 -bash
13389 pts/0 r 0:00 \_ ps f -u postgres
13191 ? s 0:00 /opt/pgsql/bin/repmgrd -f /opt/repmgr.conf
3247 ? ss 0:00 /opt/pg12/bin/postgres -d /opt/data6000
3272 ? ss 0:00 \_ postgres: logger
3680 ? ss 0:00 \_ postgres: checkpointer
3681 ? ss 0:00 \_ postgres: background writer
3682 ? ss 0:00 \_ postgres: walwriter
3683 ? ss 0:00 \_ postgres: autovacuum launcher
3684 ? ss 0:00 \_ postgres: archiver
3685 ? ss 0:00 \_ postgres: stats collector
3686 ? ss 0:00 \_ postgres: logical replication launcher
13184 ? ss 0:00 \_ postgres: repmgr repmgr 192.168.99.202(41551) idle
13189 ? ss 0:00 \_ postgres: repmgr repmgr 192.168.99.200(49369) idle
13289 ? ss 0:00 \_ postgres: walsender repmgr 192.168.99.202(41735) streaming 0/17003308
[postgres@ha2 ~]$ ps f -u postgres
pid tty stat time command
3926 pts/0 s 0:00 -bash
4056 pts/0 r 0:00 \_ ps f -u postgres
3950 ? s 0:00 /opt/pgsql/bin/repmgrd -f /opt/repmgr.conf
3241 ? ss 0:00 /opt/pg12/bin/postgres -d /opt/data6000
3259 ? ss 0:00 \_ postgres: logger
3260 ? ss 0:00 \_ postgres: startup recovering 000000070000000000000017
3674 ? ss 0:00 \_ postgres: checkpointer
3675 ? ss 0:00 \_ postgres: background writer
3676 ? ss 0:00 \_ postgres: stats collector
3948 ? ss 0:00 \_ postgres: repmgr repmgr 192.168.99.202(62092) idle
4002 ? ss 0:00 \_ postgres: walreceiver streaming 0/17003308
[postgres@ha1 ~]$ /opt/pgsql/bin/pg_ctl -d /opt/data6000/ stop
waiting for server to shut down.... done
server stopped
--查看备库ha2日志,可以看到1 分钟后,无法连接到主库ha1,程序已经开始了切库的服务
[postgres@ha2 ~]$ cat /opt/data6000/repmgr.log
...
[2020-02-03 11:25:34] [info] node "node2" (id: 2) monitoring upstream node "node1" (id: 1) in normal state
[2020-02-03 11:25:36] [warning] unable to ping "host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2"
[2020-02-03 11:25:36] [detail] pqping() returned "pqping_no_response"
[2020-02-03 11:25:36] [warning] unable to connect to upstream node "node1" (id: 1)
[2020-02-03 11:25:36] [info] checking state of node 1, 1 of 6 attempts
[2020-02-03 11:25:36] [warning] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr"
[2020-02-03 11:25:36] [detail] pqping() returned "pqping_no_response"
[2020-02-03 11:25:36] [info] sleeping 10 seconds until next reconnection attempt
[2020-02-03 11:25:46] [info] checking state of node 1, 2 of 6 attempts
[2020-02-03 11:25:46] [warning] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr"
[2020-02-03 11:25:46] [detail] pqping() returned "pqping_no_response"
[2020-02-03 11:25:46] [info] sleeping 10 seconds until next reconnection attempt
[2020-02-03 11:25:56] [info] checking state of node 1, 3 of 6 attempts
[2020-02-03 11:25:56] [warning] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr"
[2020-02-03 11:25:56] [detail] pqping() returned "pqping_no_response"
[2020-02-03 11:25:56] [info] sleeping 10 seconds until next reconnection attempt
[2020-02-03 11:26:06] [info] checking state of node 1, 4 of 6 attempts
[2020-02-03 11:26:06] [warning] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr"
[2020-02-03 11:26:06] [detail] pqping() returned "pqping_no_response"
[2020-02-03 11:26:06] [info] sleeping 10 seconds until next reconnection attempt
[2020-02-03 11:26:16] [info] checking state of node 1, 5 of 6 attempts
[2020-02-03 11:26:16] [warning] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr"
[2020-02-03 11:26:16] [detail] pqping() returned "pqping_no_response"
[2020-02-03 11:26:16] [info] sleeping 10 seconds until next reconnection attempt
[2020-02-03 11:26:26] [info] checking state of node 1, 6 of 6 attempts
[2020-02-03 11:26:26] [warning] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr"
[2020-02-03 11:26:26] [detail] pqping() returned "pqping_no_response"
[2020-02-03 11:26:26] [warning] unable to reconnect to node 1 after 6 attempts
[2020-02-03 11:26:26] [info] 0 active sibling nodes registered
[2020-02-03 11:26:26] [info] primary and this node have the same location ("default")
[2020-02-03 11:26:26] [info] no other sibling nodes - we win by default
[2020-02-03 11:26:26] [notice] this node is the only available candidate and will now promote itself
[2020-02-03 11:26:26] [info] promote_command is:
"/opt/pgsql/bin/repmgr standby promote -f /opt/repmgr.conf --log-to-file"
[2020-02-03 11:26:26] [notice] redirecting logging output to "/opt/data6000/repmgr.log"
[2020-02-03 11:26:26] [notice] promoting standby to primary
[2020-02-03 11:26:26] [detail] promoting server "node2" (id: 2) using pg_promote()
[2020-02-03 11:26:26] [notice] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2020-02-03 11:26:27] [notice] standby promote successful
[2020-02-03 11:26:27] [detail] server "node2" (id: 2) was successfully promoted to primary
[2020-02-03 11:26:27] [info] 0 followers to notify
[2020-02-03 11:26:27] [info] switching to primary monitoring mode
[2020-02-03 11:26:27] [notice] monitoring cluster primary "node2" (id: 2)
--ha2查看集群状态,ha2已经promote成功
[postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
id | name | role | status | upstream | location | priority | timeline | connection string
---- ------- --------- ----------- ---------- ---------- ---------- ---------- ---------------------------------------------------------------------------
1 | node1 | primary | - failed | | default | 100 | ? | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | primary | * running | | default | 100 | 14 | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2
warning: following issues were detected
- unable to connect to node "node1" (id: 1)
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d'host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2'
info: local node 1 can attach to rejoin target node 2
detail: local node's recovery point: 0/20000028; rejoin target node's fork point: 0/200000a0
notice: setting node 1's upstream to node 2
warning: unable to ping "host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2"
detail: pqping() returned "pqping_no_response"
notice: starting server using "/opt/pgsql/bin/pg_ctl -w -d '/opt/data6000' start"
notice: node rejoin successful
detail: node 1 is now attached to node 2
--ha1查看集群状态
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
id | name | role | status | upstream | location | priority | timeline | connection string
---- ------- --------- ----------- ---------- ---------- ---------- ---------- ---------------------------------------------------------------------------
1 | node1 | standby | running | node2 | default | 100 | 13 | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | primary | * running | | default | 100 | 14 | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2
--备库ha1执行checkpoint确保时间基线一致
[postgres@ha1 ~]$ /opt/pgsql/bin/psql -p6000 -upostgres
postgres=# delete from t;
error: cannot execute delete in a read-only transaction
postgres=# checkpoint;
checkpoint
postgres=# \q
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
id | name | role | status | upstream | location | priority | timeline | connection string
---- ------- --------- ----------- ---------- ---------- ---------- ---------- ---------------------------------------------------------------------------
1 | node1 | standby | running | node2 | default | 100 | 14 | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | primary | * running | | default | 100 | 14 | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2
从上面可以看出通过node rejoin将刚才已经关闭的主库再次上线,并以备库的身份连到新主,加入集群成功。
[postgres@ha2 ~]$ /opt/pgsql/bin/psql -p6000 -upostgres
postgres=# insert into t values(5,'from node2');
insert 0 1
postgres=# select * from t;
id | info
---- -------------------------------
1 | 2020-02-02 21:11:51.66735 08
2 | 2020-02-02 22:05:21.757823 08
3 | 2020-02-02 22:07:03.546043 08
4 | 2020-02-03 11:03:42.050227 08
5 | from node2
(5 rows)
--关闭主库ha2
[postgres@ha2 ~]$ /opt/pgsql/bin/pg_ctl -d /opt/data6000/ stop
--1分钟后查看备库ha1,可以看到已经切换为主库
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
id | name | role | status | upstream | location | priority | timeline | connection string
---- ------- --------- ----------- ---------- ---------- ---------- ---------- ---------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 15 | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | primary | - failed | | default | 100 | ? | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2
warning: following issues were detected
- unable to connect to node "node2" (id: 2)
--此时ha2先不直接使用node rejoin
--ha1主库插入一些数据,并执行checkpoint
[postgres@ha1 ~]$ /opt/pgsql/bin/psql -p6000 -upostgres
postgres=# insert into t values(6,'from node1');
insert 0 1
postgres=# select * from t;
id | info
---- -------------------------------
1 | 2020-02-02 21:11:51.66735 08
2 | 2020-02-02 22:05:21.757823 08
3 | 2020-02-02 22:07:03.546043 08
4 | 2020-02-03 11:03:42.050227 08
5 | from node2
6 | from node1
(6 rows)
postgres=# checkpoint;
checkpoint
--ha2手工启动服务
[postgres@ha2 ~]$ /opt/pgsql/bin/pg_ctl -d /opt/data6000/ start &
[postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
id | name | role | status | upstream | location | priority | timeline | connection string
---- ------- --------- ---------------------- ---------- ---------- ---------- ---------- ---------------------------------------------------------------------------
1 | node1 | standby | ! running as primary | | default | 100 | 15 | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | primary | * running | | default | 100 | 14 | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2
warning: following issues were detected
- node "node1" (id: 1) is registered as standby but running as primary
--ha2插入一些数据,并执行checkpoint
[postgres@ha2 ~]$ /opt/pgsql/bin/psql -p6000 -upostgres
postgres=# insert into t values(7,'from node2');
insert 0 1
postgres=# select * from t;
id | info
---- -------------------------------
1 | 2020-02-02 21:11:51.66735 08
2 | 2020-02-02 22:05:21.757823 08
3 | 2020-02-02 22:07:03.546043 08
4 | 2020-02-03 11:03:42.050227 08
5 | from node2
7 | from node2
(6 rows)
--ha2关闭服务
[postgres@ha2 ~]$ /opt/pgsql/bin/pg_ctl -d /opt/data6000/ stop
--ha2执行node rejoin
[postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d'host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2'
error: this node cannot attach to rejoin target node 1
detail: rejoin target server's timeline 15 forked off current database system timeline 14 before current recovery point 0/22000028
hint: use --force-rewind to execute pg_rewind
[postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d'host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind
notice: pg_rewind execution required for this node to attach to rejoin target node 1
detail: rejoin target server's timeline 15 forked off current database system timeline 14 before current recovery point 0/22000028
error: --force-rewind specified but pg_rewind cannot be used
detail: "wal_log_hints" is set to "off" and data checksums are disabled
从上面可以看出node rejoin操作失败,因为两边库的数据已经不一致。
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf primary unregister --node-id=2
info: node "node2" (id: 2) was successfully unregistered
[postgres@ha1 ~]$ vi /opt/data6000/postgresql.conf
wal_log_hints=on
[postgres@ha1 ~]$ /opt/pgsql/bin/pg_ctl -d /opt/data6000/ restart &
[postgres@ha2 ~]$ rm -rf /opt/archive6000/*
[postgres@ha2 ~]$ rm -rf /opt/data6000/*
[postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -h 192.168.99.200 -p6000 -u repmgr -d repmgr -f /opt/repmgr.conf standby clone
[postgres@ha2 ~]$ vi /opt/data6000/postgresql.conf
primary_conninfo='host=192.168.99.200 port=6000 user=repuser'
[postgres@ha2 ~]$ /opt/pgsql/bin/pg_ctl -d /opt/data6000/ start &
[postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf standby register
[postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
id | name | role | status | upstream | location | priority | timeline | connection string
---- ------- --------- ----------- ---------- ---------- ---------- ---------- ---------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 15 | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | standby | running | node1 | default | 100 | 15 | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2
[postgres@ha1 ~]$ > /opt/data6000/repmgr.log
[postgres@ha2 ~]$ > /opt/data6000/repmgr.log
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgrd -f /opt/repmgr.conf
[postgres@ha2 ~]$ /opt/pgsql/bin/repmgrd -f /opt/repmgr.conf
[postgres@ha1 ~]$ /opt/pgsql/bin/psql -p6000 -upostgres
postgres=# insert into t values(101,'from node1 '||now());
insert 0 1
postgres=# select * from t;
id | info
----- ------------------------------------------
1 | 2020-02-02 21:11:51.66735 08
2 | 2020-02-02 22:05:21.757823 08
3 | 2020-02-02 22:07:03.546043 08
4 | 2020-02-03 11:03:42.050227 08
5 | from node2
6 | from node1
100 | from node1 2020-02-03 12:20:27.444394 08
101 | from node1 2020-02-03 12:33:26.365621 08
(8 rows)
--关闭主库ha1
[postgres@ha1 ~]$ /opt/pgsql/bin/pg_ctl -d /opt/data6000/ stop
--1分钟后查看备库ha2,可以看到已经切换为主库
[postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
id | name | role | status | upstream | location | priority | timeline | connection string
---- ------- --------- ----------- ---------- ---------- ---------- ---------- ---------------------------------------------------------------------------
1 | node1 | primary | - failed | | default | 100 | ? | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | primary | * running | | default | 100 | 16 | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2
warning: following issues were detected
- unable to connect to node "node1" (id: 1)
--ha2新主库插入一些数据,并执行checkpoint
[postgres@ha2 ~]$ /opt/pgsql/bin/psql -p6000 -upostgres
postgres=# insert into t values(200,'from node2 '||now());
insert 0 1
postgres=# select * from t;
id | info
----- ------------------------------------------
1 | 2020-02-02 21:11:51.66735 08
2 | 2020-02-02 22:05:21.757823 08
3 | 2020-02-02 22:07:03.546043 08
4 | 2020-02-03 11:03:42.050227 08
5 | from node2
6 | from node1
100 | from node1 2020-02-03 12:20:27.444394 08
101 | from node1 2020-02-03 12:33:26.365621 08
200 | from node2 2020-02-03 12:39:04.007665 08
(9 rows)
postgres=# checkpoint;
checkpoint
--ha1手工启动服务
[postgres@ha1 ~]$ /opt/pgsql/bin/pg_ctl -d /opt/data6000/ start &
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
id | name | role | status | upstream | location | priority | timeline | connection string
---- ------- --------- ---------------------- ---------- ---------- ---------- ---------- ---------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 15 | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | standby | ! running as primary | | default | 100 | 16 | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2
warning: following issues were detected
- node "node2" (id: 2) is registered as standby but running as primary
--ha1插入一些数据,并执行checkpoint
[postgres@ha1 ~]$ /opt/pgsql/bin/psql -p6000 -upostgres
postgres=# insert into t values(300,'from node1 '||now());
insert 0 1
postgres=# select * from t;
id | info
----- ------------------------------------------
1 | 2020-02-02 21:11:51.66735 08
2 | 2020-02-02 22:05:21.757823 08
3 | 2020-02-02 22:07:03.546043 08
4 | 2020-02-03 11:03:42.050227 08
5 | from node2
6 | from node1
100 | from node1 2020-02-03 12:20:27.444394 08
101 | from node1 2020-02-03 12:33:26.365621 08
300 | from node1 2020-02-03 12:42:21.438444 08
(9 rows)
--ha1关闭服务
[postgres@ha1 ~]$ /opt/pgsql/bin/pg_ctl -d /opt/data6000/ stop
--ha1执行node rejoin
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d'host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind
notice: pg_rewind execution required for this node to attach to rejoin target node 2
detail: rejoin target server's timeline 16 forked off current database system timeline 15 before current recovery point 0/27000028
notice: executing pg_rewind
detail: pg_rewind command is "/opt/pgsql/bin/pg_rewind -d '/opt/data6000' --source-server='host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2'"
pg_rewind: servers diverged at wal location 0/260000a0 on timeline 15
pg_rewind: rewinding from last common checkpoint at 0/26000028 on timeline 15
pg_rewind: done!
notice: 0 files copied to /opt/data6000
notice: setting node 1's upstream to node 2
warning: unable to ping "host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2"
detail: pqping() returned "pqping_no_response"
notice: starting server using "/opt/pgsql/bin/pg_ctl -w -d '/opt/data6000' start"
warning: unable to ping "host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2"
detail: pqping() returned "pqping_reject"
notice: node rejoin successful
detail: node 1 is now attached to node 2
--ha1查看集群状态
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show
id | name | role | status | upstream | location | priority | timeline | connection string
---- ------- --------- ----------- ---------- ---------- ---------- ---------- ---------------------------------------------------------------------------
1 | node1 | standby | running | node2 | default | 100 | 15 | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | primary | * running | | default | 100 | 16 | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2
--ha1执行checkpoint或者服务restart,timeline保持一致
--验证数据
[postgres@ha1 ~]$ /opt/pgsql/bin/psql -p6000 -upostgres
postgres=# select * from t;
id | info
----- ------------------------------------------
1 | 2020-02-02 21:11:51.66735 08
2 | 2020-02-02 22:05:21.757823 08
3 | 2020-02-02 22:07:03.546043 08
4 | 2020-02-03 11:03:42.050227 08
5 | from node2
6 | from node1
100 | from node1 2020-02-03 12:20:27.444394 08
101 | from node1 2020-02-03 12:33:26.365621 08
200 | from node2 2020-02-03 12:39:04.007665 08
(9 rows)
postgres=# delete from t;
error: cannot execute delete in a read-only transaction
[postgres@ha2 ~]$ /opt/pgsql/bin/psql -p6000 -upostgres
postgres=# select * from t;
id | info
----- ------------------------------------------
1 | 2020-02-02 21:11:51.66735 08
2 | 2020-02-02 22:05:21.757823 08
3 | 2020-02-02 22:07:03.546043 08
4 | 2020-02-03 11:03:42.050227 08
5 | from node2
6 | from node1
100 | from node1 2020-02-03 12:20:27.444394 08
101 | from node1 2020-02-03 12:33:26.365621 08
200 | from node2 2020-02-03 12:39:04.007665 08
(9 rows)
postgres=# insert into t values(400,'from node2 '||now());
insert 0 1
从上面测试可以看出加上wal_log_hints后,主备库数据如果不一致时仍然可以使用node rejoin操作成功,新主库上线至旧主库rejoin之间时间段内,旧主库的写数据会丢失。
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d’host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2’
error: this node cannot attach to rejoin target node 2
detail: rejoin target server’s timeline 8 forked off current database system timeline 7 before current recovery point 0/1a000028
hint: use --force-rewind to execute pg_rewind
–分析:主备库时间线不一致
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d’host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2’ --force-rewind
notice: pg_rewind execution required for this node to attach to rejoin target node 2
detail: rejoin target server’s timeline 8 forked off current database system timeline 7 before current recovery point 0/1a000028
error: --force-rewind specified but pg_rewind cannot be used
detail: “wal_log_hints” is set to “off” and data checksums are disabled
–分析:数据库未开启wal_log_hints,数据目录初始化时也未打开checksums
timeline不一致问题,如何同步
1.备库执行checkpoint
2.关闭备库,执行node rejoin