[pgpool-general: 7515] Re: Possible pgpool 4.1.4 failover auto_failback race condition

Takuma Hoshiai hoshiai.takuma at nttcom.co.jp
Fri Apr 23 14:32:49 JST 2021


Hi,

On 2021/04/21 20:39, Nathan Ward wrote:
> 
> 
>> On 21/04/2021, at 10:49 PM, Nathan Ward <lists+pgpool at daork.net> wrote:
>>
>> Hi,
>>
>>> On 21/04/2021, at 8:38 PM, Takuma Hoshiai <hoshiai.takuma at nttcom.co.jp> wrote:
>>>
>>> 2. Extend auto_failback_interval in order not to happen this problem
>>
>> I have the auto_failback_interval set to 60s - which should mean that there is 1 minute to run through follow primary etc. before it attempts to auto_failback the node - however it seems to run immediately.
>>
>> I can make this 5 minutes to see if there is any difference, but, the auto_failback runs right away so I’m not sure there will be.
> 
> Hi,
> 
> I have done some testing of auto_failback_interval = 300. Below is some debug log from my primary pgpool server - I have removed extra lines but added timestamps so the times can be seen:
> 
> 2021-04-21T22:56:27 pid=5904 level=DETAIL:  starting to select new master node
> 2021-04-21T22:56:27 pid=5904 level=LOG:  starting degeneration. shutdown host 10.0.10.15(5433)
> 2021-04-21T22:56:27 pid=5904 level=LOG:  execute command: /usr/local/libexec/pgpool_failover 0 10.0.10.15 5433 /var/lib/pgsql/13/data 1 10.0.40.15 0 0 5433 /var/lib/pgsql/13/data 10.0.10.15 5433
> 2021-04-21T22:56:28 pid=6359 level=LOG:  execute command: /usr/local/libexec/pgpool_follow_master 0 10.0.10.15 5433 /var/lib/pgsql/13/data 1 10.0.40.15 0 0 5433 /var/lib/pgsql/13/data 10.0.10.15 5433
> 
> 2021-04-21T22:56:28 pid=6173 level=DEBUG:  health check DB node: 2 (status:3) for auto_failback
> 
> 2021-04-21T22:56:28 pid=6173 level=LOG:  request auto failback, node id:2
> 2021-04-21T22:56:28 pid=6173 level=LOG:  received failback request for node_id: 2 from pid [6173]
> 
> 2021-04-21T22:56:31 ssh: connect to host 10.0.10.15 port 22: Connection timed out      << This is a log from the pgpool_follow_master from line 4 above started at 2021-04-21T22:56:28. (ssh timeout is 3s)
> 
> 
> 
> My understanding of the code is that auto_failback_interval set to 60 prevents auto_failback running more than once (i.e. recovering more than once) within 60s. If it runs even just one time, it will cause this problem.
> 
> On a recently started pgpool node, the health check runs sequentially - backend 0, then 1, then 2 (the same order do_health_check_child is run for each node_id).
> In my test case, the failing node is 1, which sets CON_DOWN for each backend - but then the health_check (with auto_failback) for node 2 happens very shortly afterwards and recovers it from CON_DOWN state.

Thank you for verifying the behavior.I understood that 
auto_failback_interval can't avoid this case.


> I think the simplest solution is to make auto_failback a property on the shared node BackendInfo struct, and set it to now + auto_failback_interval *every* time the backend state is set to CON_DOWN. This would change the behaviour of auto_failback_interval slightly - but in a way which is what I think users would expect: It would not automatically recover the backend less than auto_recovery_interval after it went down.
> 
> I would be happy to submit a patch for this if you would like :-)

We welcome the patches! If you send it, I look at it too.

> 
> --
> Nathan Ward
> 

Best Regards,

-- 
Takuma Hoshiai <hoshiai.takuma at nttcom.co.jp>



More information about the pgpool-general mailing list