View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000339 | Pgpool-II | Bug | public | 2017-09-13 16:27 | 2018-01-11 21:28 |
| Reporter | tom-sato | Assigned To | Muhammad Usama | ||
| Priority | high | Severity | block | Reproducibility | always |
| Status | closed | Resolution | won't fix | ||
| Platform | Linux | OS | CentOS | OS Version | 6.9 |
| Product Version | 3.3.17 | ||||
| Summary | 0000339: Failover does not occurr if master Pgpool-II and primary PostgreSQL servers are stopped at the same time | ||||
| Description | There are 3 machines running PostgreSQL and Pgpool-II servers. Pgpool-II servers monitor each other using watchdog. Failover does not occurr if master Pgpool-II and primary PostgreSQL servers are stopped at the same time. The same problem also occurs with Pgpool-II 3.6.6. But the messages are different. If there are 2 Pgpool-II servers, the problem does not occurr. Attach pgpool.conf and /var/log/messages on each server. | ||||
| Steps To Reproduce | 1. Start up the 1st Pgpool-II server. $ pgpool Sep 13 15:50:47 alice pgpool[2992]: wd_chk_setuid all commands have setuid bit Sep 13 15:50:47 alice pgpool[2992]: watchdog might call network commands which using setuid bit. Sep 13 15:50:47 alice pgpool[2993]: Backend status file /tmp/pgpool_status does not exist Sep 13 15:50:47 alice pgpool[2993]: wd_create_send_socket: connect() reports failure (Connection refused). You can safely ignore this while starting up. Sep 13 15:50:47 alice pgpool[2993]: send_packet_4_nodes: packet for bob:9000 is canceled Sep 13 15:50:47 alice pgpool[2993]: wd_create_send_socket: connect() reports failure (Connection refused). You can safely ignore this while starting up. Sep 13 15:50:47 alice pgpool[2993]: send_packet_4_nodes: packet for carol:9000 is canceled Sep 13 15:50:50 alice pgpool[2993]: wd_escalation: escalating to master pgpool Sep 13 15:50:54 alice pgpool[2993]: wd_IP_up: ifconfig up succeeded Sep 13 15:50:54 alice pgpool[2993]: wd_escalation: escalated to master pgpool successfully Sep 13 15:50:54 alice pgpool[2993]: wd_init: start watchdog Sep 13 15:50:54 alice pgpool[2993]: pgpool-II successfully started. version 3.3.17 (tokakiboshi) Sep 13 15:50:54 alice pgpool[2993]: find_primary_node: primary node id is 0 Sep 13 15:50:55 alice pgpool[3003]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:50:55 alice pgpool[3003]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:50:55 alice pgpool[3004]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:50:55 alice pgpool[3004]: wd_create_hb_recv_socket: set SO_REUSEPORT Sep 13 15:50:55 alice pgpool[3005]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:50:55 alice pgpool[3005]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:50:55 alice pgpool[3002]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:50:55 alice pgpool[3002]: wd_create_hb_recv_socket: set SO_REUSEPORT 2. Start up the 2nd Pgpool-II server. $ pgpool Sep 13 15:51:54 bob pgpool[3706]: wd_chk_setuid all commands have setuid bit Sep 13 15:51:54 bob pgpool[3706]: watchdog might call network commands which using setuid bit. Sep 13 15:51:54 bob pgpool[3707]: Backend status file /tmp/pgpool_status does not exist Sep 13 15:51:54 bob pgpool[3707]: wd_create_send_socket: connect() reports failure (Connection refused). You can safely ignore this while starting up. Sep 13 15:51:54 bob pgpool[3707]: send_packet_4_nodes: packet for carol:9000 is canceled Sep 13 15:51:54 bob pgpool[3707]: wd_init: start watchdog Sep 13 15:51:54 bob pgpool[3707]: pgpool-II successfully started. version 3.3.17 (tokakiboshi) Sep 13 15:51:54 bob pgpool[3707]: find_primary_node: primary node id is 0 Sep 13 15:51:55 bob pgpool[3710]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:51:55 bob pgpool[3710]: wd_create_hb_recv_socket: set SO_REUSEPORT Sep 13 15:51:55 bob pgpool[3711]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:51:55 bob pgpool[3711]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:51:55 bob pgpool[3712]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:51:55 bob pgpool[3712]: wd_create_hb_recv_socket: set SO_REUSEPORT Sep 13 15:51:55 bob pgpool[3713]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:51:55 bob pgpool[3713]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:51:54 alice pgpool[3001]: wd_send_response: receive add request from bob:9999 and accept it 3. Start up the 3rd Pgpool-II server. $ pgpool Sep 13 15:52:30 carol pgpool[4254]: wd_chk_setuid all commands have setuid bit Sep 13 15:52:30 carol pgpool[4254]: watchdog might call network commands which using setuid bit. Sep 13 15:52:30 carol pgpool[4255]: Backend status file /tmp/pgpool_status does not exist Sep 13 15:52:30 carol pgpool[4255]: wd_init: start watchdog Sep 13 15:52:30 carol pgpool[4255]: pgpool-II successfully started. version 3.3.17 (tokakiboshi) Sep 13 15:52:30 carol pgpool[4255]: find_primary_node: primary node id is 0 Sep 13 15:52:31 carol pgpool[4259]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:52:31 carol pgpool[4259]: wd_create_hb_recv_socket: set SO_REUSEPORT Sep 13 15:52:31 carol pgpool[4260]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:52:31 carol pgpool[4260]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:52:31 carol pgpool[4261]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:52:31 carol pgpool[4261]: wd_create_hb_recv_socket: set SO_REUSEPORT Sep 13 15:52:31 carol pgpool[4262]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:52:31 carol pgpool[4262]: wd_create_hb_send_socket: set SO_REUSEPORT Sep 13 15:52:30 alice pgpool[3001]: wd_send_response: receive add request from carol:9999 and accept it Sep 13 15:52:35 alice pgpool[3006]: watchdog: lifecheck started 4. Run pgbench via virtual IP. $ /usr/pgsql-9.6/bin/pgbench -h 192.168.137.105 -p 9999 -U postgres -T 300 -c 50 pgbench starting vacuum...end. 5. Forcibly stop master Pgpool-II and primary PostgreSQL servers and bring virtual IP address down. $ pgrep 'postmaster|pgpool' | xargs kill -KILL $ ifconfig eth0:0 down 6. I cannot connect to Pgpool-II server. $ psql -h 192.168.137.105 -p 9999 -U postgres -l psql: could not connect to server: No route to host Is the server running on host "192.168.137.105" and accepting TCP/IP connections on port 9999? 7. The following messages continues to be output to the log file. Sep 13 15:56:25 bob pgpool[3766]: connect_inet_domain_socket: getsockopt() detected error: Connection refused Sep 13 15:56:25 bob pgpool[3766]: make_persistent_db_connection: connection to alice(5432) failed Sep 13 15:56:25 bob pgpool[3766]: check_replication_time_lag: could not connect to DB node 0, check sr_check_user and sr_check_password Sep 13 15:56:25 bob pgpool[3707]: connect_inet_domain_socket: getsockopt() detected error: Connection refused Sep 13 15:56:25 bob pgpool[3707]: make_persistent_db_connection: connection to alice(5432) failed Sep 13 15:56:25 bob pgpool[3707]: connect_inet_domain_socket: getsockopt() detected error: Connection refused Sep 13 15:56:25 bob pgpool[3707]: make_persistent_db_connection: connection to alice(5432) failed Sep 13 15:56:25 bob pgpool[3707]: health check failed. 0 th host alice at port 5432 is down Sep 13 15:56:25 bob pgpool[3707]: health check retry sleep time: 5 second(s) (snip) Sep 13 15:56:55 bob pgpool[3714]: check_pgpool_status_by_hb: lifecheck failed. pgpool 1 (alice:9999) seems not to be working Sep 13 15:56:55 bob pgpool[3714]: pgpool_down: alice:9999 is going down Sep 13 15:56:55 bob pgpool[3714]: pgpool_down: I'm oldest so standing for master Sep 13 15:57:00 bob pgpool[3766]: connect_inet_domain_socket: getsockopt() detected error: Connection refused Sep 13 15:57:00 bob pgpool[3766]: make_persistent_db_connection: connection to alice(5432) failed Sep 13 15:57:00 bob pgpool[3766]: check_replication_time_lag: could not connect to DB node 0, check sr_check_user and sr_check_password Sep 13 15:57:00 bob pgpool[3707]: connect_inet_domain_socket: getsockopt() detected error: Connection refused Sep 13 15:57:00 bob pgpool[3707]: make_persistent_db_connection: connection to alice(5432) failed Sep 13 15:57:00 bob pgpool[3707]: health check failed. 0 th host alice at port 5432 is down Sep 13 15:57:00 bob pgpool[3707]: set 0 th backend down status Sep 13 15:57:00 bob pgpool[3707]: wd_start_interlock: start interlocking Sep 13 15:57:00 bob pgpool[3707]: wd_create_send_socket: connect() reports failure (Connection refused). You can safely ignore this while starting up. Sep 13 15:57:00 bob pgpool[3707]: send_packet_4_nodes: packet for alice:9000 is canceled Sep 13 15:57:00 bob pgpool[3707]: wd_create_send_socket: connect() reports failure (Connection refused). You can safely ignore this while starting up. Sep 13 15:57:00 bob pgpool[3707]: send_packet_4_nodes: packet for alice:9000 is canceled Sep 13 15:57:01 bob pgpool[3707]: wd_create_send_socket: connect() reports failure (Connection refused). You can safely ignore this while starting up. Sep 13 15:57:01 bob pgpool[3707]: send_packet_4_nodes: packet for alice:9000 is canceled | ||||
| Tags | failover, pgpool, watchdog | ||||
|
|
|
|
|
Hi, We have fixed similar issue in Pgpool-II 3.5 and 3.6 branches. So can you please try on the latest Pgpool-II releases for these branches to check if the issue is fixed. As for the older versions prior to Pgpool-II 3.5, the watchdog in those versions has some known shortcomings and uses the old binary communication protocol with no dedicated watchdog process, so it is difficult to get that fix in the older versions. |
|
|
|
|
|
Hi, Thank you for your reply. > We have fixed similar issue in Pgpool-II 3.5 and 3.6 branches. So can you please try on the latest Pgpool-II releases for these branches to check if the issue is fixed. I tried it in version 3.6.7, but the same problem occurs. > As for the older versions prior to Pgpool-II 3.5, the watchdog in those versions has some known shortcomings and uses the old binary communication protocol with no dedicated watchdog process, so it is difficult to get that fix in the older versions. OK, It can not be helped. |
|
|
Hi Looking at the logs you shared in the last response, It seems that pgpool-II is successfully initiating and propagating the failover request. But I can see couple of issues which may be contributing to the system not behaving as expected. 1-) health checking is disabled on all pgpool servers, and as you are using the pgbench on a delegate-ip so only the master pgpool will be able to detect the backend failure while other pgpools will remain inactive and because of disabled health check they would not know about the backend PG server becoming unavailable. 2-) Second the bob log "messages-bob-2.txt" suggests that the pgpool is continuously busy looking for primary node after starting up, So can you verify your setup why it is not able to find the primary server. Since the log has no debug information so it is difficult to guess the cause for that. Can you please re-run the test after enabling the health-check on all pgpool nodes and also enabling the DEBUG messages output. If the problem persists, it would be easier to debug with DEBUG enabled. Thanks |
|
|
Hello, The reason why the problem occurred in Pgpool-II 3.6 series is that I did not correctly specify the command option of the failover_command parameter and failover was not executed successfully. If I correctly specified the command option, the problem did not occur in both Pgpool-II 3.6.6 and 3.6.7. In the Pgpool-II 3.3 series, problems occur even if the command option is correctly specified. It seems to be a problem specific to the old version. Thank you. |
|
|
Hi, Please close this ticket with "won't fix" status in the Pgpool-II 3.3 series. Thanks. |
|
|
The issue is fixed in Pgpool-II 3.5, 3.6 and 3.7 branches. But unfortunately for the older versions prior to Pgpool-II 3.5, the watchdog in those versions has some known shortcomings and uses the old binary communication protocol with no dedicated watchdog process and it is not feasible to get that fix in the older versions. |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2017-09-13 16:27 | tom-sato | New Issue | |
| 2017-09-13 16:27 | tom-sato | File Added: pgpool-alice.conf | |
| 2017-09-13 16:27 | tom-sato | File Added: pgpool-bob.conf | |
| 2017-09-13 16:27 | tom-sato | File Added: pgpool-carol.conf | |
| 2017-09-13 16:27 | tom-sato | File Added: messages-alice.txt | |
| 2017-09-13 16:27 | tom-sato | File Added: messages-bob.txt | |
| 2017-09-13 16:27 | tom-sato | File Added: messages-carol.txt | |
| 2017-09-13 16:27 | tom-sato | Tag Attached: failover | |
| 2017-09-13 16:27 | tom-sato | Tag Attached: pgpool | |
| 2017-09-13 16:27 | tom-sato | Tag Attached: watchdog | |
| 2017-10-03 15:49 | t-ishii | Assigned To | => Muhammad Usama |
| 2017-10-03 15:49 | t-ishii | Status | new => assigned |
| 2017-12-14 00:23 | Muhammad Usama | Note Added: 0001867 | |
| 2017-12-26 15:42 | tom-sato | File Added: pgpool-alice-2.conf | |
| 2017-12-26 15:42 | tom-sato | File Added: pgpool-bob-2.conf | |
| 2017-12-26 15:42 | tom-sato | File Added: pgpool-carol-2.conf | |
| 2017-12-26 15:42 | tom-sato | File Added: ps-ax-alice.txt | |
| 2017-12-26 15:42 | tom-sato | File Added: ps-ax-bob.txt | |
| 2017-12-26 15:42 | tom-sato | File Added: ps-ax-carol.txt | |
| 2017-12-26 15:42 | tom-sato | File Added: messages-alice-2.txt | |
| 2017-12-26 15:42 | tom-sato | File Added: messages-bob-2.txt | |
| 2017-12-26 15:42 | tom-sato | File Added: messages-carol-2.txt | |
| 2017-12-26 15:42 | tom-sato | Note Added: 0001880 | |
| 2017-12-27 20:48 | Muhammad Usama | Note Added: 0001883 | |
| 2018-01-04 14:28 | tom-sato | Note Added: 0001886 | |
| 2018-01-11 14:30 | tom-sato | Note Added: 0001888 | |
| 2018-01-11 21:28 | Muhammad Usama | Status | assigned => closed |
| 2018-01-11 21:28 | Muhammad Usama | Resolution | open => won't fix |
| 2018-01-11 21:28 | Muhammad Usama | Note Added: 0001889 |