[pgpool-general: 7512] Re: Possible pgpool 4.1.4 failover auto_failback race condition

Wed Apr 21 17:38:09 JST 2021

Hi,

Thank you for your information. I understood.
Unfortunately, there is a possibility that this situation is happend
by pgpool.  The follow_primary_command is executed asynchronous with
a failover process, so pgpool can't judge why node status is down.

If you want to avoid this on current pgpool, you can take either
following measures.
1. Do not use auto_failback and follow_primary_command features together

2. Extend auto_failback_interval in order not to happen this problem

I will try that auto_failback is not happened while
follow_primary_command on next version of pgpool.

On 2021/04/18 17:00, Nathan Ward wrote:
> Hi,
> 
> follow_master_command for backend 2 is not executed, because backend 2 auto_failback happens while node 0 follow_master_command is running.
> 
> Note that if backend 0 follow_master_command runs quickly then this doesn’t happen.
> 
> It happens in my case, because backend 0 is not reachable, so ssh times out. I have the ssh login timeout set to 3s (default is 30s), which is enough time for auto_failback to detect that backend 2 is reachable.
> If I test the failure by stopping postgres, or by setting iptables to reject connections rather than dropping packets, then backend 0 follow_master_command returns error immediately and follow_master_command for backend 1 can run before auto_failback brings it back online.
> 
> I think that we probably want a failover to block auto_failback. I recall that auto_failback is scheduled only if in_transition (or something like that) is false - perhaps that flag is not being set, or it is being cleared? I’m not sure, I don’t know the code well enough and I’m working from my memory.. I almost certainly have the name (in_transition) incorrect, as I say, I’m working from memory.. I will check the code later.
> 
>> On 16/04/2021, at 4:18 PM, Takuma Hoshiai <hoshiai.takuma at nttcom.co.jp> wrote:
>>
>> Hi,
>>
>> Thank you for your report.
>>
>> Do you mean that failover_command is executed, but postgres
>> node 2 fail back while running follow_master_command?
>> Or follow_master_command is not excuted?
>> I research it. Could you share pgpool.log?
>>
>> On 2021/04/14 14:54, Nathan Ward wrote:
>>> Hi,
>>> I believe I’ve found a race condition with auto_failback.
>>> In my test environment I have 3 servers each running both pgpool and postgres.
>>> I simulate a network failure with iptables rules on one node.
>>> I start the test with the following state:
>>> pgpool primary: node 2
>>> postgres primary: node 0
>>> When I fail node 0, in order to trigger failover with follow_master (4.1.x still), I find that most of the time node 2 is reattached before follow_master gets a chance to run for that node. It is of course set to CON_DOWN when the failover is triggered, and I would expect it to stay in that state until follow_master reattaches it.
>>> I believe, though I’m not 100% certain, that sometimes this comes from node 1.
>>> Is this likely a configuration problem, or, is this a bug of some kind? I had a quick look at the code, and don’t see any changes that would impact this since 4.1.4 - but I am of course happy to be wrong about that !
>>> We have the following set:
>>> auto_failback = on
>>> auto_failback_interval = 60
>>> --
>>> Nathan Ward
>>> _______________________________________________
>>> pgpool-general mailing list
>>> pgpool-general at pgpool.net
>>> http://www.pgpool.net/mailman/listinfo/pgpool-general
>>
>> Best Regards,
>>
>> -- 
>> Takuma Hoshiai <hoshiai.takuma at nttcom.co.jp>
>>
> 
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general
> 

Best Regards,

-- 
Takuma Hoshiai <hoshiai.takuma at nttcom.co.jp>