View Issue Details

IDProjectCategoryView StatusLast Update
0000339Pgpool-IIBugpublic2018-01-11 21:28
Reportertom-sato Assigned ToMuhammad Usama  
PriorityhighSeverityblockReproducibilityalways
Status closedResolutionwon't fix 
PlatformLinuxOSCentOSOS Version6.9
Product Version3.3.17 
Summary0000339: Failover does not occurr if master Pgpool-II and primary PostgreSQL servers are stopped at the same time
DescriptionThere are 3 machines running PostgreSQL and Pgpool-II servers. Pgpool-II servers monitor each other using watchdog. Failover does not occurr if master Pgpool-II and primary PostgreSQL servers are stopped at the same time.

The same problem also occurs with Pgpool-II 3.6.6. But the messages are different. If there are 2 Pgpool-II servers, the problem does not occurr.

Attach pgpool.conf and /var/log/messages on each server.
Steps To Reproduce1. Start up the 1st Pgpool-II server.

$ pgpool

Sep 13 15:50:47 alice pgpool[2992]: wd_chk_setuid all commands have setuid bit
Sep 13 15:50:47 alice pgpool[2992]: watchdog might call network commands which using setuid bit.
Sep 13 15:50:47 alice pgpool[2993]: Backend status file /tmp/pgpool_status does not exist
Sep 13 15:50:47 alice pgpool[2993]: wd_create_send_socket: connect() reports failure (Connection refused). You can safely ignore this while starting up.
Sep 13 15:50:47 alice pgpool[2993]: send_packet_4_nodes: packet for bob:9000 is canceled
Sep 13 15:50:47 alice pgpool[2993]: wd_create_send_socket: connect() reports failure (Connection refused). You can safely ignore this while starting up.
Sep 13 15:50:47 alice pgpool[2993]: send_packet_4_nodes: packet for carol:9000 is canceled
Sep 13 15:50:50 alice pgpool[2993]: wd_escalation: escalating to master pgpool
Sep 13 15:50:54 alice pgpool[2993]: wd_IP_up: ifconfig up succeeded
Sep 13 15:50:54 alice pgpool[2993]: wd_escalation: escalated to master pgpool successfully
Sep 13 15:50:54 alice pgpool[2993]: wd_init: start watchdog
Sep 13 15:50:54 alice pgpool[2993]: pgpool-II successfully started. version 3.3.17 (tokakiboshi)
Sep 13 15:50:54 alice pgpool[2993]: find_primary_node: primary node id is 0
Sep 13 15:50:55 alice pgpool[3003]: wd_create_hb_send_socket: set SO_REUSEPORT
Sep 13 15:50:55 alice pgpool[3003]: wd_create_hb_send_socket: set SO_REUSEPORT
Sep 13 15:50:55 alice pgpool[3004]: wd_create_hb_send_socket: set SO_REUSEPORT
Sep 13 15:50:55 alice pgpool[3004]: wd_create_hb_recv_socket: set SO_REUSEPORT
Sep 13 15:50:55 alice pgpool[3005]: wd_create_hb_send_socket: set SO_REUSEPORT
Sep 13 15:50:55 alice pgpool[3005]: wd_create_hb_send_socket: set SO_REUSEPORT
Sep 13 15:50:55 alice pgpool[3002]: wd_create_hb_send_socket: set SO_REUSEPORT
Sep 13 15:50:55 alice pgpool[3002]: wd_create_hb_recv_socket: set SO_REUSEPORT

2. Start up the 2nd Pgpool-II server.

$ pgpool

Sep 13 15:51:54 bob pgpool[3706]: wd_chk_setuid all commands have setuid bit
Sep 13 15:51:54 bob pgpool[3706]: watchdog might call network commands which using setuid bit.
Sep 13 15:51:54 bob pgpool[3707]: Backend status file /tmp/pgpool_status does not exist
Sep 13 15:51:54 bob pgpool[3707]: wd_create_send_socket: connect() reports failure (Connection refused). You can safely ignore this while starting up.
Sep 13 15:51:54 bob pgpool[3707]: send_packet_4_nodes: packet for carol:9000 is canceled
Sep 13 15:51:54 bob pgpool[3707]: wd_init: start watchdog
Sep 13 15:51:54 bob pgpool[3707]: pgpool-II successfully started. version 3.3.17 (tokakiboshi)
Sep 13 15:51:54 bob pgpool[3707]: find_primary_node: primary node id is 0
Sep 13 15:51:55 bob pgpool[3710]: wd_create_hb_send_socket: set SO_REUSEPORT
Sep 13 15:51:55 bob pgpool[3710]: wd_create_hb_recv_socket: set SO_REUSEPORT
Sep 13 15:51:55 bob pgpool[3711]: wd_create_hb_send_socket: set SO_REUSEPORT
Sep 13 15:51:55 bob pgpool[3711]: wd_create_hb_send_socket: set SO_REUSEPORT
Sep 13 15:51:55 bob pgpool[3712]: wd_create_hb_send_socket: set SO_REUSEPORT
Sep 13 15:51:55 bob pgpool[3712]: wd_create_hb_recv_socket: set SO_REUSEPORT
Sep 13 15:51:55 bob pgpool[3713]: wd_create_hb_send_socket: set SO_REUSEPORT
Sep 13 15:51:55 bob pgpool[3713]: wd_create_hb_send_socket: set SO_REUSEPORT

Sep 13 15:51:54 alice pgpool[3001]: wd_send_response: receive add request from bob:9999 and accept it

3. Start up the 3rd Pgpool-II server.

$ pgpool

Sep 13 15:52:30 carol pgpool[4254]: wd_chk_setuid all commands have setuid bit
Sep 13 15:52:30 carol pgpool[4254]: watchdog might call network commands which using setuid bit.
Sep 13 15:52:30 carol pgpool[4255]: Backend status file /tmp/pgpool_status does not exist
Sep 13 15:52:30 carol pgpool[4255]: wd_init: start watchdog
Sep 13 15:52:30 carol pgpool[4255]: pgpool-II successfully started. version 3.3.17 (tokakiboshi)
Sep 13 15:52:30 carol pgpool[4255]: find_primary_node: primary node id is 0
Sep 13 15:52:31 carol pgpool[4259]: wd_create_hb_send_socket: set SO_REUSEPORT
Sep 13 15:52:31 carol pgpool[4259]: wd_create_hb_recv_socket: set SO_REUSEPORT
Sep 13 15:52:31 carol pgpool[4260]: wd_create_hb_send_socket: set SO_REUSEPORT
Sep 13 15:52:31 carol pgpool[4260]: wd_create_hb_send_socket: set SO_REUSEPORT
Sep 13 15:52:31 carol pgpool[4261]: wd_create_hb_send_socket: set SO_REUSEPORT
Sep 13 15:52:31 carol pgpool[4261]: wd_create_hb_recv_socket: set SO_REUSEPORT
Sep 13 15:52:31 carol pgpool[4262]: wd_create_hb_send_socket: set SO_REUSEPORT
Sep 13 15:52:31 carol pgpool[4262]: wd_create_hb_send_socket: set SO_REUSEPORT

Sep 13 15:52:30 alice pgpool[3001]: wd_send_response: receive add request from carol:9999 and accept it
Sep 13 15:52:35 alice pgpool[3006]: watchdog: lifecheck started

4. Run pgbench via virtual IP.

$ /usr/pgsql-9.6/bin/pgbench -h 192.168.137.105 -p 9999 -U postgres -T 300 -c 50 pgbench
starting vacuum...end.

5. Forcibly stop master Pgpool-II and primary PostgreSQL servers and bring virtual IP address down.

$ pgrep 'postmaster|pgpool' | xargs kill -KILL
$ ifconfig eth0:0 down

6. I cannot connect to Pgpool-II server.

$ psql -h 192.168.137.105 -p 9999 -U postgres -l
psql: could not connect to server: No route to host
        Is the server running on host "192.168.137.105" and accepting
        TCP/IP connections on port 9999?

7. The following messages continues to be output to the log file.

Sep 13 15:56:25 bob pgpool[3766]: connect_inet_domain_socket: getsockopt() detected error: Connection refused
Sep 13 15:56:25 bob pgpool[3766]: make_persistent_db_connection: connection to alice(5432) failed
Sep 13 15:56:25 bob pgpool[3766]: check_replication_time_lag: could not connect to DB node 0, check sr_check_user and sr_check_password
Sep 13 15:56:25 bob pgpool[3707]: connect_inet_domain_socket: getsockopt() detected error: Connection refused
Sep 13 15:56:25 bob pgpool[3707]: make_persistent_db_connection: connection to alice(5432) failed
Sep 13 15:56:25 bob pgpool[3707]: connect_inet_domain_socket: getsockopt() detected error: Connection refused
Sep 13 15:56:25 bob pgpool[3707]: make_persistent_db_connection: connection to alice(5432) failed
Sep 13 15:56:25 bob pgpool[3707]: health check failed. 0 th host alice at port 5432 is down
Sep 13 15:56:25 bob pgpool[3707]: health check retry sleep time: 5 second(s)
(snip)
Sep 13 15:56:55 bob pgpool[3714]: check_pgpool_status_by_hb: lifecheck failed. pgpool 1 (alice:9999) seems not to be working
Sep 13 15:56:55 bob pgpool[3714]: pgpool_down: alice:9999 is going down
Sep 13 15:56:55 bob pgpool[3714]: pgpool_down: I'm oldest so standing for master
Sep 13 15:57:00 bob pgpool[3766]: connect_inet_domain_socket: getsockopt() detected error: Connection refused
Sep 13 15:57:00 bob pgpool[3766]: make_persistent_db_connection: connection to alice(5432) failed
Sep 13 15:57:00 bob pgpool[3766]: check_replication_time_lag: could not connect to DB node 0, check sr_check_user and sr_check_password
Sep 13 15:57:00 bob pgpool[3707]: connect_inet_domain_socket: getsockopt() detected error: Connection refused
Sep 13 15:57:00 bob pgpool[3707]: make_persistent_db_connection: connection to alice(5432) failed
Sep 13 15:57:00 bob pgpool[3707]: health check failed. 0 th host alice at port 5432 is down
Sep 13 15:57:00 bob pgpool[3707]: set 0 th backend down status
Sep 13 15:57:00 bob pgpool[3707]: wd_start_interlock: start interlocking
Sep 13 15:57:00 bob pgpool[3707]: wd_create_send_socket: connect() reports failure (Connection refused). You can safely ignore this while starting up.
Sep 13 15:57:00 bob pgpool[3707]: send_packet_4_nodes: packet for alice:9000 is canceled
Sep 13 15:57:00 bob pgpool[3707]: wd_create_send_socket: connect() reports failure (Connection refused). You can safely ignore this while starting up.
Sep 13 15:57:00 bob pgpool[3707]: send_packet_4_nodes: packet for alice:9000 is canceled
Sep 13 15:57:01 bob pgpool[3707]: wd_create_send_socket: connect() reports failure (Connection refused). You can safely ignore this while starting up.
Sep 13 15:57:01 bob pgpool[3707]: send_packet_4_nodes: packet for alice:9000 is canceled
Tagsfailover, pgpool, watchdog

Activities

tom-sato

2017-09-13 16:27

reporter  

pgpool-alice.conf (31,176 bytes)
pgpool-bob.conf (31,178 bytes)
pgpool-carol.conf (31,176 bytes)
messages-alice.txt (2,744 bytes)   
messages-alice.txt (2,744 bytes)   
messages-bob.txt (2,669,660 bytes)
messages-carol.txt (153,373 bytes)   
messages-carol.txt (153,373 bytes)   

Muhammad Usama

2017-12-14 00:23

developer   ~0001867

Hi,

We have fixed similar issue in Pgpool-II 3.5 and 3.6 branches. So can you please try on the latest Pgpool-II releases for these branches to check if the issue is fixed.
As for the older versions prior to Pgpool-II 3.5, the watchdog in those versions has some known shortcomings and uses the old binary communication protocol with no dedicated watchdog process, so it is difficult to get that fix in the older versions.

tom-sato

2017-12-26 15:42

reporter  

pgpool-alice-2.conf (34,598 bytes)
pgpool-bob-2.conf (34,600 bytes)
pgpool-carol-2.conf (34,598 bytes)
ps-ax-alice.txt (52 bytes)   
ps-ax-alice.txt (52 bytes)   

tom-sato

2017-12-26 15:42

reporter   ~0001880

Hi,

Thank you for your reply.

> We have fixed similar issue in Pgpool-II 3.5 and 3.6 branches. So can you please try on the latest Pgpool-II releases for these branches to check if the issue is fixed.

I tried it in version 3.6.7, but the same problem occurs.

> As for the older versions prior to Pgpool-II 3.5, the watchdog in those versions has some known shortcomings and uses the old binary communication protocol with no dedicated watchdog process, so it is difficult to get that fix in the older versions.

OK, It can not be helped.
ps-ax-bob.txt (1,408 bytes)   
ps-ax-bob.txt (1,408 bytes)   
ps-ax-carol.txt (1,533 bytes)   
ps-ax-carol.txt (1,533 bytes)   
messages-alice-2.txt (20,122 bytes)   
messages-alice-2.txt (20,122 bytes)   
messages-bob-2.txt (145,532 bytes)   
messages-bob-2.txt (145,532 bytes)   
messages-carol-2.txt (1,505 bytes)   
messages-carol-2.txt (1,505 bytes)   

Muhammad Usama

2017-12-27 20:48

developer   ~0001883

Hi

Looking at the logs you shared in the last response, It seems that pgpool-II is successfully initiating and propagating the failover request. But I can see couple of issues which may be contributing to the system not behaving as expected.
1-) health checking is disabled on all pgpool servers, and as you are using the pgbench on a delegate-ip so only the master pgpool will be able to detect the backend failure while other pgpools will remain inactive and because of disabled health check they would not know about the backend PG server becoming unavailable.
2-) Second the bob log "messages-bob-2.txt" suggests that the pgpool is continuously busy looking for primary node after starting up, So can you verify your setup why it is not able to find the primary server. Since the log has no debug information so it is difficult to guess the cause for that.

Can you please re-run the test after enabling the health-check on all pgpool nodes and also enabling the DEBUG messages output.
If the problem persists, it would be easier to debug with DEBUG enabled.

Thanks

tom-sato

2018-01-04 14:28

reporter   ~0001886

Hello,

The reason why the problem occurred in Pgpool-II 3.6 series is that I did not correctly specify the command option of the failover_command parameter and failover was not executed successfully.

If I correctly specified the command option, the problem did not occur in both Pgpool-II 3.6.6 and 3.6.7.

In the Pgpool-II 3.3 series, problems occur even if the command option is correctly specified. It seems to be a problem specific to the old version.

Thank you.

tom-sato

2018-01-11 14:30

reporter   ~0001888

Hi,

Please close this ticket with "won't fix" status in the Pgpool-II 3.3 series.

Thanks.

Muhammad Usama

2018-01-11 21:28

developer   ~0001889

The issue is fixed in Pgpool-II 3.5, 3.6 and 3.7 branches. But unfortunately for the older versions prior to Pgpool-II 3.5, the watchdog in those versions has some known shortcomings and uses the old binary communication protocol with no dedicated watchdog process and it is not feasible to get that fix in the older versions.

Issue History

Date Modified Username Field Change
2017-09-13 16:27 tom-sato New Issue
2017-09-13 16:27 tom-sato File Added: pgpool-alice.conf
2017-09-13 16:27 tom-sato File Added: pgpool-bob.conf
2017-09-13 16:27 tom-sato File Added: pgpool-carol.conf
2017-09-13 16:27 tom-sato File Added: messages-alice.txt
2017-09-13 16:27 tom-sato File Added: messages-bob.txt
2017-09-13 16:27 tom-sato File Added: messages-carol.txt
2017-09-13 16:27 tom-sato Tag Attached: failover
2017-09-13 16:27 tom-sato Tag Attached: pgpool
2017-09-13 16:27 tom-sato Tag Attached: watchdog
2017-10-03 15:49 t-ishii Assigned To => Muhammad Usama
2017-10-03 15:49 t-ishii Status new => assigned
2017-12-14 00:23 Muhammad Usama Note Added: 0001867
2017-12-26 15:42 tom-sato File Added: pgpool-alice-2.conf
2017-12-26 15:42 tom-sato File Added: pgpool-bob-2.conf
2017-12-26 15:42 tom-sato File Added: pgpool-carol-2.conf
2017-12-26 15:42 tom-sato File Added: ps-ax-alice.txt
2017-12-26 15:42 tom-sato File Added: ps-ax-bob.txt
2017-12-26 15:42 tom-sato File Added: ps-ax-carol.txt
2017-12-26 15:42 tom-sato File Added: messages-alice-2.txt
2017-12-26 15:42 tom-sato File Added: messages-bob-2.txt
2017-12-26 15:42 tom-sato File Added: messages-carol-2.txt
2017-12-26 15:42 tom-sato Note Added: 0001880
2017-12-27 20:48 Muhammad Usama Note Added: 0001883
2018-01-04 14:28 tom-sato Note Added: 0001886
2018-01-11 14:30 tom-sato Note Added: 0001888
2018-01-11 21:28 Muhammad Usama Status assigned => closed
2018-01-11 21:28 Muhammad Usama Resolution open => won't fix
2018-01-11 21:28 Muhammad Usama Note Added: 0001889