View Issue Details

IDProjectCategoryView StatusLast Update
0000135Pgpool-IIBugpublic2015-12-16 20:38
Reporterjanuszb Assigned ToMuhammad Usama  
PrioritynormalSeveritymajorReproducibilitysometimes
Status closedResolutionfixed 
PlatformlinuxOScentos OS Version7
Summary0000135: Delegate IP does not get up on Standby upon Active gets disconnected
Description[root@ib-wawa-189 ~]# pgpool --version
pgpool-II version 3.4.2 (tataraboshi)

When Active watchdog gets disconnected (cable unplugged), the Standby does not react. As a result service is not available.
Sometimes Standby reacts correctly bringing up the delegate IP, often it does not. In 4 experiments it reacts correctly only once (on average) in my experience.
Steps To ReproduceUnplug cable from Active and watch Standby
Additional InformationUpon Active gets disconnected, the Standby enters an infinite loop logging every few seconds:
May 20 13:22:30 localhost pgpool: 2015-05-20 13:22:30: pid 1229: DEBUG: watchdog heartbeat: send heartbeat signal to 192.168.10.189:9694
May 20 13:22:32 localhost pgpool[902]: [5430-1] 2015-05-20 13:22:32: pid 902: LOG: failed to create watchdog sending socket
May 20 13:22:32 localhost pgpool[902]: [5430-2] 2015-05-20 13:22:32: pid 902: DETAIL: connect() reports failure "No route to host"
May 20 13:22:32 localhost pgpool[902]: [5430-3] 2015-05-20 13:22:32: pid 902: HINT: You can safely ignore this while starting up.
May 20 13:22:32 localhost pgpool[902]: [5431-1] 2015-05-20 13:22:32: pid 902: LOG: watchdog sending packet for nodes
May 20 13:22:32 localhost pgpool[902]: [5431-2] 2015-05-20 13:22:32: pid 902: DETAIL: packet for "192.168.10.189:9000" is canceled
May 20 13:22:32 localhost pgpool: 2015-05-20 13:22:32: pid 902: LOG: failed to create watchdog sending socket
May 20 13:22:32 localhost pgpool: 2015-05-20 13:22:32: pid 902: DETAIL: connect() reports failure "No route to host"
May 20 13:22:32 localhost pgpool: 2015-05-20 13:22:32: pid 902: HINT: You can safely ignore this while starting up.
May 20 13:22:32 localhost pgpool: 2015-05-20 13:22:32: pid 902: LOG: watchdog sending packet for nodes
May 20 13:22:32 localhost pgpool: 2015-05-20 13:22:32: pid 902: DETAIL: packet for "192.168.10.189:9000" is canceled
TagsNo tags attached.

Activities

januszb

2015-05-20 23:41

reporter  

messages (253,587 bytes)   
messages (253,587 bytes)   

januszb

2015-05-20 23:46

reporter   ~0000535

the attached log in "messages" file show the situation when Standby dows not bring up delegate IP. At May 20 16:31:49 it notices that the Active .189 is not reachable, but no action is taken to make .188 watchdog active

januszb

2015-05-21 00:50

reporter   ~0000536

Interesting: in the described scenario, the Standby node is supposed to
1. promote PG
2. bring up the delegate IP
It does none of the two. However, just when I plug the Active back to the network a few minutes later, Standby performs promoting!

januszb

2015-05-21 21:40

reporter   ~0000538

I believe this problem is an installer bug. installer2-pg92-3.4.0 in ./lib/pgpool.sh it does:
for host 0
          _writePgpoolParam heartbeat_destination0 "'${PGPOOL_HOST_ARR[0]}'"
for host 1
       _writePgpoolParam heartbeat_destination0 "'${PGPOOL_HOST_ARR[0]}'"

Muhammad Usama

2015-12-16 16:43

developer   ~0000615

Hi

As in the heartbeat mode, the watchdog monitors the health of other watchdog nodes by sending out the periodic UDP packets, since the UDP is a connectionless protocol, so the heartbeat can only detect the failure of node if it notices the absence of heartbeat signals from another node. But lifecheck only starts to monitor the absence of heartbeat signals from other watchdog nodes after receiving atleast one heartbeat message from the node and before it receives the first heartbeat signal it considers the watchdog node as not started yet and hence remains silent when it does not receive heartbeat from the node. Apparently what is happening in the situation when lifecheck is failing to detect the cable unplug on the remote node is the cable is unplugged at the time of startup of pgpool-II before it sends the first heartbeat, so the node is not registered as a alive node when the cable was unplugged and consequently the other watchdog node never reacts.

Also if you can provide the log of both standby and active watchdog when the situation happens that would be more helpful in analyzing the cause of the problem.
As a side note we are also in the process of overhauling the watchdog and lifechecking process in pgpool-II 3.5 which is currently in the beta mode and hopefully that will provide a better experience and more features.

januszb

2015-12-16 18:19

reporter   ~0000616

Hi!

We are done with this problem for a long time already. It appeared after installing pgpool using the installer2-pg92-3.4.0 . The installer had in ./lib/pgpool.sh it does:
for host 0
          _writePgpoolParam heartbeat_destination0 "'${PGPOOL_HOST_ARR[0]}'"

while I believe it should have
for host 0
          _writePgpoolParam heartbeat_destination0 "'${PGPOOL_HOST_ARR[1]}'"

After we made the config files by ourselves it works fluently. Thanks!

Muhammad Usama

2015-12-16 20:36

developer   ~0000617

Glad to hear your problem was solved. Many thanks for updating the status.

Muhammad Usama

2015-12-16 20:38

developer   ~0000618

The problem was fixed by the configuration changes

Issue History

Date Modified Username Field Change
2015-05-20 22:02 januszb New Issue
2015-05-20 23:41 januszb File Added: messages
2015-05-20 23:46 januszb Note Added: 0000535
2015-05-21 00:50 januszb Note Added: 0000536
2015-05-21 21:40 januszb Note Added: 0000538
2015-08-04 10:19 t-ishii Assigned To => Muhammad Usama
2015-08-04 10:19 t-ishii Status new => assigned
2015-12-16 16:43 Muhammad Usama Note Added: 0000615
2015-12-16 18:19 januszb Note Added: 0000616
2015-12-16 20:36 Muhammad Usama Note Added: 0000617
2015-12-16 20:38 Muhammad Usama Note Added: 0000618
2015-12-16 20:38 Muhammad Usama Status assigned => closed
2015-12-16 20:38 Muhammad Usama Resolution open => fixed