0000536: Recovery of standby node in streaming replication mode causes all connections to be dropped - Pgpool-II Bug Tracker

ID	Project	Category	View Status	Date Submitted	Last Update

0000536	Pgpool-II	Bug	public	2019-08-09 20:11	2020-04-14 15:45

Reporter	gmartinez	Assigned To	hoshiai
Priority	normal	Severity	minor	Reproducibility	always
Status	resolved	Resolution	fixed
Platform	Linux	OS	CentOS	OS Version	7
Product Version	3.7.10

Summary	0000536: Recovery of standby node in streaming replication mode causes all connections to be dropped
Description	When we run pcp_recovery_node on a standby node while in streaming replication, it seems that pgpool drops all connections to the database now, causing errors in the application. We have load balance disabled and only use pgpool for the failover ability. Previously, we were on 3.7.7 and this behavior did not occur.
Steps To Reproduce	1) Stop postgresql on standby node 2) wait for pgpool to notice it's down 3) Start postgresql 4) Run pcp_recovery_node on the downed postgresql 5) Note that all children restarted and the frontend application has dropped connections.
Additional Information	Log events: Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 624658: LOG: node recovery, node: 4 restarted Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 624658: LOG: received failback request for node_id: 4 from pid [624658] Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597090: LOG: new IPC connection received Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597090: LOG: watchdog received the failover command from local pgpool-II on IPC interface Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597090: LOG: watchdog is processing the failover command [FAILBACK_REQUEST] received from local pgpool-II on IPC interface Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597090: LOG: The failover request does not need quorum to hold Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597090: DETAIL: proceeding with the failover Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597090: HINT: REQ_DETAIL_CONFIRMED Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597088: LOG: Pgpool-II parent process has received failover request Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 598089: WARNING: failover/failback is in progress Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 598089: DETAIL: executing failover or failback on backend Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 598089: HINT: In a moment you should be able to reconnect to the database Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597318: WARNING: failover/failback is in progress Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597318: DETAIL: executing failover or failback on backend Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597318: HINT: In a moment you should be able to reconnect to the database Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597090: LOG: new IPC connection received Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597971: WARNING: failover/failback is in progress Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597971: DETAIL: executing failover or failback on backend Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597971: HINT: In a moment you should be able to reconnect to the database Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597993: WARNING: failover/failback is in progress Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597993: DETAIL: executing failover or failback on backend Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597993: HINT: In a moment you should be able to reconnect to the database Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 598089: WARNING: failover/failback is in progress Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 598089: DETAIL: executing failover or failback on backend Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 598089: HINT: In a moment you should be able to reconnect to the database Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597090: LOG: received the failover indication from Pgpool-II on IPC interface Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597318: WARNING: failover/failback is in progress Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597318: DETAIL: executing failover or failback on backend Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597318: HINT: In a moment you should be able to reconnect to the database Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597090: LOG: watchdog is informed of failover start by the main process Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597971: WARNING: failover/failback is in progress Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597971: DETAIL: executing failover or failback on backend Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597971: HINT: In a moment you should be able to reconnect to the database Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597993: WARNING: failover/failback is in progress Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597993: DETAIL: executing failover or failback on backend Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597993: HINT: In a moment you should be able to reconnect to the database Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597088: LOG: starting fail back. reconnect host db005(5432) Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597088: LOG: Node 0 is not down (status: 2) Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597088: LOG: Do not restart children because we are failing back node id 4 host: db005 port: 5432 and we are in streaming replication mode and not all backends were down Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597088: LOG: find_primary_node_repeatedly: waiting for finding a primary node Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597088: LOG: find_primary_node: checking backend no 0 Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597088: LOG: find_primary_node: checking backend no 1 Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597088: LOG: find_primary_node: checking backend no 2 Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 598130: WARNING: failover/failback is in progress Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 598130: DETAIL: executing failover or failback on backend Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 598130: HINT: In a moment you should be able to reconnect to the database Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 598130: WARNING: failover/failback is in progress Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 598130: DETAIL: executing failover or failback on backend Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 598130: HINT: In a moment you should be able to reconnect to the database Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597088: LOG: find_primary_node: checking backend no 3 Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597088: LOG: find_primary_node: primary node id is 3 Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597088: LOG: failover: set new primary node: 3 Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597088: LOG: failover: set new master node: 0 Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597090: LOG: new IPC connection received Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597090: LOG: received the failover indication from Pgpool-II on IPC interface Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597090: LOG: watchdog is informed of failover end by the main process Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597088: LOG: failback done. reconnect host db005(5432) Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597088: LOG: start health check process for host db005(5432) Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 597088: LOG: start health check process for host db005(5432) Aug 09 10:43:54 db006 cgexec[597088]: 2019-08-09 10:43:54: pid 624658: LOG: recovery done Aug 09 10:43:55 db006 cgexec[597088]: 2019-08-09 10:43:55: pid 614305: LOG: failback event detected Aug 09 10:43:55 db006 cgexec[597088]: 2019-08-09 10:43:55: pid 614305: DETAIL: restarting myself Aug 09 10:43:55 db006 cgexec[597088]: 2019-08-09 10:43:55: pid 598356: LOG: restart request received in pcp child process Aug 09 10:43:55 db006 cgexec[597088]: 2019-08-09 10:43:55: pid 597088: LOG: PCP child 598356 exits with status 0 in failover() Aug 09 10:43:55 db006 cgexec[597088]: 2019-08-09 10:43:55: pid 597088: LOG: fork a new PCP child pid 624833 in failover() Aug 09 10:43:55 db006 cgexec[597088]: 2019-08-09 10:43:55: pid 617874: LOG: failback event detected Aug 09 10:43:55 db006 cgexec[597088]: 2019-08-09 10:43:55: pid 617874: DETAIL: restarting myself Aug 09 10:43:55 db006 cgexec[597088]: 2019-08-09 10:43:55: pid 597137: LOG: failback event detected (repeat failback many times)
Tags	No tags attached.

gmartinez 2019-08-09 20:11 reporter	pgpool.conf (8,041 bytes)

hoshiai 2019-08-22 14:31 developer ~0002796 Last edited: 2019-08-22 14:33	When pgpool execute failover or failoback, alll backend process of pgpool is restarted. But if target is standby and standby is not loadbalance, pgpool is waiting session finished before restarting process. The following log is shown, when the to restart process waited until that session finished: LOG: failback event detected DETAIL: restarting myself I saw this message in your log, so I don't understand the problem that your connection was dropped happend. Could you share all log, when this problem is happened and log_min_messages set to 'DEBUG5' ?

gmartinez 2019-08-27 07:36 reporter ~0002801	I will set the logging to DEBUG5 in our lab and try to reproduce there.

hoshiai 2019-08-27 14:35 developer ~0002803	Thank you. I wait your trying result.

gmartinez 2019-09-21 08:18 reporter ~0002873	So I was able to reproduce this behavior on my CentOS 6 lab as well and I noticed that shutting down postgresql on a standby node or re-attaching a downed node would also trigger this behavior. I did see errors in the application: Caused by: org.postgresql.util.PSQLException: This connection has been closed. and saw the same behavior I mentioned earlier in the pgpool logs. DEBUG5 logging was extremely verbose (and had query logs!), so I'm not sure how to capture the logs without exposing data while the application is running.

hoshiai 2019-11-27 16:39 developer ~0002979	Sorry, for the very very late for reply. I investigated your ploblem. In conclusion, this problem(pgpool session is reset during failover/failback) is intended behavior as pgpool community. This behavior has safety for not causing more bigger problem(i.g. segment fault) during failover/failback. This design is mainly for failover, so In the future, I would like to fix that failback(online recovery) will be not caused this behavior.

Date Modified	Username	Field	Change
2019-08-09 20:11	gmartinez	New Issue
2019-08-09 20:11	gmartinez	File Added: pgpool.conf
2019-08-15 16:32	t-ishii	Assigned To	=> hoshiai
2019-08-15 16:32	t-ishii	Status	new => assigned
2019-08-15 16:32	t-ishii	Description Updated
2019-08-15 16:32	t-ishii	Steps to Reproduce Updated
2019-08-15 16:32	t-ishii	Additional Information Updated
2019-08-22 14:31	hoshiai	Note Added: 0002796
2019-08-22 14:33	hoshiai	Note Edited: 0002796
2019-08-22 14:33	hoshiai	Status	assigned => feedback
2019-08-27 07:36	gmartinez	Note Added: 0002801
2019-08-27 07:36	gmartinez	Status	feedback => assigned
2019-08-27 14:35	hoshiai	Note Added: 0002803
2019-09-21 08:18	gmartinez	Note Added: 0002873
2019-11-27 16:39	hoshiai	Status	assigned => feedback
2019-11-27 16:39	hoshiai	Note Added: 0002979
2020-04-14 15:45	hoshiai	Status	feedback => resolved
2020-04-14 15:45	hoshiai	Resolution	open => fixed