自動failover會使用到repmgrd這個daemon指令,repmgrd只會對standby有作用,primary也可以執行,只是不會有任何效果
Step 1. 檢查目前Standby為哪一節點
[pgadm@pgrep1 ~]$ repmgr -f /pgdata/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+--------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------- 1 | pgrep1 | standby | running | pgrep2 | default | 100 | 12 | host=pgrep1 user=repmgr dbname=repmgr connect_timeout=2 2 | pgrep2 | primary | * running | | default | 100 | 12 | host=pgrep2 user=repmgr dbname=repmgr connect_timeout=2
Step 2. 檢查repmgrd是否已啟動,如果有的話,kill掉repmgrd process再啟動即可,不影響PostgreSQL服務
[pgadm@pgrep1 ~]$ ps -ef | grep repmgrd | grep -v grep [pgadm@pgrep1 ~]$ repmgrd -f /pgdata/repmgr.conf -d [2022-03-28 15:37:00] [NOTICE] redirecting logging output to "/tmp/repmgr.log"
Step 3. 確認一下 Paused 是否為 no,如為 yes,則不會進行auto failover
[pgadm@pgrep1 ~]$ repmgr -f /pgdata/repmgr.conf service status ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+--------+---------+-----------+----------+-------------+--------+---------+-------------------- 1 | pgrep1 | standby | running | pgrep2 | running | 327705 | no | 0 second(s) ago 2 | pgrep2 | primary | * running | | not running | 95165 | no | n/a
Step 4. 測試自動切換,刪掉postgres process,從node2切換到node1
[pgadm@pgrep2 ~]$ ps -ef | grep postgres pgadm 150753 1 0 14:41 ? 00:00:00 /pgbin/pghome_1/bin/postgres -D /pgdata/dbdata [pgadm@pgrep2 ~]$ kill -9 150753
Step 5. 檢查node1的repmgr.log
[2022-03-28 15:45:46] [WARNING] unable to ping "host=pgrep2 user=repmgr dbname=repmgr connect_timeout=2" [2022-03-28 15:45:46] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2022-03-28 15:45:46] [WARNING] unable to connect to upstream node "pgrep2" (ID: 2) [2022-03-28 15:45:46] [INFO] checking state of node "pgrep2" (ID: 2), 1 of 3 attempts [2022-03-28 15:45:46] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=pgrep2 fallback_application_name=repmgr" [2022-03-28 15:45:46] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2022-03-28 15:45:46] [INFO] sleeping up to 10 seconds until next reconnection attempt [2022-03-28 15:45:56] [INFO] checking state of node "pgrep2" (ID: 2), 2 of 3 attempts [2022-03-28 15:45:56] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=pgrep2 fallback_application_name=repmgr" [2022-03-28 15:45:56] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2022-03-28 15:45:56] [INFO] sleeping up to 10 seconds until next reconnection attempt [2022-03-28 15:46:06] [INFO] checking state of node "pgrep2" (ID: 2), 3 of 3 attempts [2022-03-28 15:46:06] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=pgrep2 fallback_application_name=repmgr" [2022-03-28 15:46:06] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2022-03-28 15:46:06] [WARNING] unable to reconnect to node "pgrep2" (ID: 2) after 3 attempts [2022-03-28 15:46:06] [INFO] 0 active sibling nodes registered [2022-03-28 15:46:06] [INFO] 2 total nodes registered [2022-03-28 15:46:06] [INFO] primary node "pgrep2" (ID: 2) and this node have the same location ("default") [2022-03-28 15:46:06] [INFO] no other sibling nodes - we win by default [2022-03-28 15:46:06] [NOTICE] this node is the only available candidate and will now promote itself [2022-03-28 15:46:06] [INFO] promote_command is: "/pgdata/repmgr_promote.sh" [2022-03-28 15:46:06]: remove VIP start. Pseudo-terminal will not be allocated because stdin is not a terminal.^M [2022-03-28 15:46:06]: remove VIP finish. Connection fail, failover start. [2022-03-28 15:46:06]: promote start failover Start [2022-03-28 15:46:06] [NOTICE] redirecting logging output to "/tmp/repmgr.log" [2022-03-28 15:46:06] [NOTICE] promoting standby to primary [2022-03-28 15:46:06] [DETAIL] promoting server "pgrep1" (ID: 1) using pg_promote() [2022-03-28 15:46:06] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete [2022-03-28 15:46:07] [NOTICE] STANDBY PROMOTE successful [2022-03-28 15:46:07] [DETAIL] server "pgrep1" (ID: 1) was successfully promoted to primary failover Success. [2022-03-28 15:46:07]: promote finish [2022-03-28 15:46:07]: remove VIP start. Pseudo-terminal will not be allocated because stdin is not a terminal.^M RTNETLINK answers: Cannot assign requested address [2022-03-28 15:46:07]: remove VIP finish. [2022-03-28 15:46:07]: add VIP start. [2022-03-28 15:46:07]: add VIP finish. [2022-03-28 15:46:07] [INFO] checking state of node 1, 1 of 3 attempts [2022-03-28 15:46:07] [NOTICE] node 1 has recovered, reconnecting [2022-03-28 15:46:07] [INFO] connection to node 1 succeeded [2022-03-28 15:46:07] [INFO] original connection is still available [2022-03-28 15:46:07] [INFO] 0 followers to notify [2022-03-28 15:46:07] [INFO] switching to primary monitoring mode [2022-03-28 15:46:07] [NOTICE] monitoring cluster primary "pgrep1" (ID: 1)
Step 6. 此時原Primary Node2會自動關閉,如果手動帶起舊Priamry的話,則在Cluster上會有問題,但只要認得" * ",主要就是最新的Primary
[pgadm@pgrep2 ~]$ pg_ctl start -D $PGDATA pg_ctl: another server might be running; trying to start server anyway waiting for server to start....2022-03-28 15:50:35.440 CST [152774] LOG: redirecting log output to logging collector process 2022-03-28 15:50:35.440 CST [152774] HINT: Future log output will appear in directory "/pgdata/pglog". done server started [pgadm@pgrep1 ~]$ repmgr -f /pgdata/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+--------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------- 1 | pgrep1 | primary | * running | | default | 100 | 13 | host=pgrep1 user=repmgr dbname=repmgr connect_timeout=2 2 | pgrep2 | primary | ! running | | default | 100 | 12 | host=pgrep2 user=repmgr dbname=repmgr connect_timeout=2 WARNING: following issues were detected - node "pgrep2" (ID: 2) is running but the repmgr node record is inactive
Step 7. 需要將舊Primary Node2 rejoin到Cluster裡;手動dry-run測試看看,沒問題就去掉dry-run直接執行
[pgadm@pgrep2 ~]$ repmgr node rejoin -f /pgdata/repmgr.conf \ -d 'host=pgrep1 dbname=repmgr user=repmgr' \ --force-rewind --verbose --dry-run NOTICE: using provided configuration file "/pgdata/repmgr.conf" NOTICE: rejoin target is node "pgrep1" (ID: 1) INFO: replication connection to the rejoin target node was successful INFO: local and rejoin target system identifiers match DETAIL: system identifier is 7076346855312920271 NOTICE: pg_rewind execution required for this node to attach to rejoin target node 1 DETAIL: rejoin target server's timeline 13 forked off current database system timeline 12 before current recovery point 0/24000028 INFO: prerequisites for using pg_rewind are met INFO: temporary archive directory "/tmp/repmgr-config-archive-pgrep2" created INFO: 0 files would have been copied to "/tmp/repmgr-config-archive-pgrep2" INFO: temporary archive directory "/tmp/repmgr-config-archive-pgrep2" deleted INFO: pg_rewind would now be executed DETAIL: pg_rewind command is: /pgbin/pghome_1/bin/pg_rewind -D '/pgdata/dbdata' --source-server='host=pgrep1 user=repmgr dbname=repmgr connect_timeout=2' INFO: prerequisites for executing NODE REJOIN are met
[pgadm@pgrep2 ~]$ repmgr node rejoin -f /pgdata/repmgr.conf \ -d 'host=pgrep1 dbname=repmgr user=repmgr' \ --force-rewind --verbose NOTICE: using provided configuration file "/pgdata/repmgr.conf" NOTICE: rejoin target is node "pgrep1" (ID: 1) NOTICE: pg_rewind execution required for this node to attach to rejoin target node 1 DETAIL: rejoin target server's timeline 13 forked off current database system timeline 12 before current recovery point 0/24000028 INFO: prerequisites for using pg_rewind are met INFO: 0 files copied to "/tmp/repmgr-config-archive-pgrep2" NOTICE: executing pg_rewind DETAIL: pg_rewind command is "/pgbin/pghome_1/bin/pg_rewind -D '/pgdata/dbdata' --source-server='host=pgrep1 user=repmgr dbname=repmgr connect_timeout=2'" NOTICE: 0 files copied to /pgdata/dbdata INFO: directory "/tmp/repmgr-config-archive-pgrep2" deleted NOTICE: setting node 2's upstream to node 1 WARNING: unable to ping "host=pgrep2 user=repmgr dbname=repmgr connect_timeout=2" DETAIL: PQping() returned "PQPING_NO_RESPONSE" NOTICE: starting server using "/pgbin/pghome_1/bin/pg_ctl -w -D '/pgdata/dbdata' start" INFO: node "pgrep2" (ID: 2) is pingable INFO: node "pgrep2" (ID: 2) has attached to its upstream node NOTICE: NODE REJOIN successful DETAIL: node 2 is now attached to node 1
Note: 如果不想讓repmgrd自動發生切換,可以手動暫停切換service pause,若要恢復切換,則下service unpause
[pgadm@pgrep1 ~]$ repmgr -f /pgdata/repmgr.conf service pause NOTICE: node 1 (pgrep1) paused NOTICE: node 2 (pgrep2) paused [pgadm@pgrep1 ~]$ repmgr -f /pgdata/repmgr.conf service status ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+--------+---------+-----------+----------+-------------+--------+---------+-------------------- 1 | pgrep1 | standby | running | pgrep2 | running | 327705 | yes | 3 second(s) ago 2 | pgrep2 | primary | * running | | not running | 95165 | yes | n/a [pgadm@pgrep1 ~]$ repmgr -f /pgdata/repmgr.conf service unpause NOTICE: node 1 (pgrep1) unpaused NOTICE: node 2 (pgrep2) unpaused [pgadm@pgrep1 ~]$ repmgr -f /pgdata/repmgr.conf service status ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+--------+---------+-----------+----------+-------------+--------+---------+-------------------- 1 | pgrep1 | standby | running | pgrep2 | running | 327705 | no | 0 second(s) ago 2 | pgrep2 | primary | * running | | not running | 95165 | no | n/a
Note: 每次在自動failover後,standby端的repmgrd會自動結束掉,因此需要手動再帶起來,此時service就看得到repmgrd目前是正在執行中
[pgadm@pgrep2 ~]$ repmgrd -f /pgdata/repmgr.conf -d [2022-03-28 16:06:06] [NOTICE] redirecting logging output to "/tmp/repmgr.log" [pgadm@pgrep1 ~]$ repmgr -f /pgdata/repmgr.conf service status ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+--------+---------+-----------+----------+---------+--------+---------+-------------------- 1 | pgrep1 | primary | * running | | running | 327705 | no | n/a 2 | pgrep2 | standby | running | pgrep1 | running | 152828 | no | 4 second(s) ago
0 留言