[pgpool-hackers: 472] Re: Constant failure of worker child process when no primary node is found

Thu Mar 27 17:14:15 JST 2014

Good catch! Thank you for finding the problem and proper fix.  I have
committed and pushed your patch to master and all supported branches.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> Hi,
> 
> I've identified that worker child process keeps failing when
> there's no primary backend in the cluster at the moment.
> 
> We're running a streaming replication cluster with 3 nodes. Because of
> a sudden H/W problem of the master server, I detached the node from
> pgpool using `pcp_detach_node` command, in order to temporarily make
> the cluster read-only. pgpool then worked as expected, it rejected
> write requests and still served read requests using the remaining two
> standby nodes. The problem is, the worker child process which checks
> for replication lag, keeps failing and being recreated without a
> pause. So naturally it generates massive amount of log messages, and
> keeps creating backend connections, which in turn leaves a large
> number of TIME_WAIT sockets on the system.
> 
> 2014-03-27 15:33:57 LOG:   pid: 21996 fork a new worker child pid 28453
> 2014-03-27 15:33:57 ERROR: pid: 28453 do_query: error message from backend:
> recovery is in progress. Exit this session.
> 2014-03-27 15:33:57 LOG:   pid: 21996 worker child 28453 exits with status
> 256
> 2014-03-27 15:33:57 LOG:   pid: 21996 fork a new worker child pid 28455
> 2014-03-27 15:33:57 ERROR: pid: 28455 do_query: error message from backend:
> recovery is in progress. Exit this session.
> 2014-03-27 15:33:57 LOG:   pid: 21996 worker child 28455 exits with status
> 256
> 2014-03-27 15:33:57 LOG:   pid: 21996 fork a new worker child pid 28459
> 2014-03-27 15:33:57 ERROR: pid: 28459 do_query: error message from backend:
> recovery is in progress. Exit this session.
> 2014-03-27 15:33:57 LOG:   pid: 21996 worker child 28459 exits with status
> 256
> ...
> 
> Looking at the code, I found that the cause is that it invokes
> `pg_current_xlog_location()` on the first remaining standby node, when
> it shouldn't. In fact, there's no reason to check for replication lag
> in such case.
> 
> I've attached a simple patch, which makes worker child to skip checking
> when there's no primary node.
> 
> Please take a look.
> 
> Thanks.
> Junegunn Choi.