[pgpool-committers: 1859] pgpool: Fix worker child process keeps failing when there's no primary

Muhammad Usama m.usama at gmail.com
Tue May 6 20:08:10 JST 2014


Fix worker child process keeps failing when there's no primary backend.

Problem identified and fix contributed by Junegunn Choi.

From pgpool-hackers: 471
==================================================================
Hi,

I've identified that worker child process keeps failing when
there's no primary backend in the cluster at the moment.

We're running a streaming replication cluster with 3 nodes. Because of
a sudden H/W problem of the master server, I detached the node from
pgpool using `pcp_detach_node` command, in order to temporarily make
the cluster read-only. pgpool then worked as expected, it rejected
write requests and still served read requests using the remaining two
standby nodes. The problem is, the worker child process which checks
for replication lag, keeps failing and being recreated without a
pause. So naturally it generates massive amount of log messages, and
keeps creating backend connections, which in turn leaves a large
number of TIME_WAIT sockets on the system.

2014-03-27 15:33:57 LOG:   pid: 21996 fork a new worker child pid 28453
2014-03-27 15:33:57 ERROR: pid: 28453 do_query: error message from backend:
recovery is in progress. Exit this session.
2014-03-27 15:33:57 LOG:   pid: 21996 worker child 28453 exits with status
256
2014-03-27 15:33:57 LOG:   pid: 21996 fork a new worker child pid 28455
2014-03-27 15:33:57 ERROR: pid: 28455 do_query: error message from backend:
recovery is in progress. Exit this session.
2014-03-27 15:33:57 LOG:   pid: 21996 worker child 28455 exits with status
256
2014-03-27 15:33:57 LOG:   pid: 21996 fork a new worker child pid 28459
2014-03-27 15:33:57 ERROR: pid: 28459 do_query: error message from backend:
recovery is in progress. Exit this session.
2014-03-27 15:33:57 LOG:   pid: 21996 worker child 28459 exits with status
256
...

Looking at the code, I found that the cause is that it invokes
`pg_current_xlog_location()` on the first remaining standby node, when
it shouldn't. In fact, there's no reason to check for replication lag
in such case.

I've attached a simple patch, which makes worker child to skip checking
when there's no primary node.

Please take a look.
Thanks.
Junegunn Choi.

Branch
------
EXCEPTION_MGR

Details
-------
http://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=5bbdfdde6962637021079ba75232a80be95d0f2d
Author: Tatsuo Ishii <ishii at postgresql.org>

Modified Files
--------------
src/streaming_replication/pool_worker_child.c |    6 ++++++
1 file changed, 6 insertions(+)



More information about the pgpool-committers mailing list