[pgpool-general: 7511] Re: Possible pgpool 4.1.4 failover auto_failback race condition

Sun Apr 18 17:00:32 JST 2021

Hi,

follow_master_command for backend 2 is not executed, because backend 2 auto_failback happens while node 0 follow_master_command is running.

Note that if backend 0 follow_master_command runs quickly then this doesn’t happen.

It happens in my case, because backend 0 is not reachable, so ssh times out. I have the ssh login timeout set to 3s (default is 30s), which is enough time for auto_failback to detect that backend 2 is reachable.
If I test the failure by stopping postgres, or by setting iptables to reject connections rather than dropping packets, then backend 0 follow_master_command returns error immediately and follow_master_command for backend 1 can run before auto_failback brings it back online.

I think that we probably want a failover to block auto_failback. I recall that auto_failback is scheduled only if in_transition (or something like that) is false - perhaps that flag is not being set, or it is being cleared? I’m not sure, I don’t know the code well enough and I’m working from my memory.. I almost certainly have the name (in_transition) incorrect, as I say, I’m working from memory.. I will check the code later.

> On 16/04/2021, at 4:18 PM, Takuma Hoshiai <hoshiai.takuma at nttcom.co.jp> wrote:
> 
> Hi,
> 
> Thank you for your report.
> 
> Do you mean that failover_command is executed, but postgres
> node 2 fail back while running follow_master_command?
> Or follow_master_command is not excuted?
> I research it. Could you share pgpool.log?
> 
> On 2021/04/14 14:54, Nathan Ward wrote:
>> Hi,
>> I believe I’ve found a race condition with auto_failback.
>> In my test environment I have 3 servers each running both pgpool and postgres.
>> I simulate a network failure with iptables rules on one node.
>> I start the test with the following state:
>> pgpool primary: node 2
>> postgres primary: node 0
>> When I fail node 0, in order to trigger failover with follow_master (4.1.x still), I find that most of the time node 2 is reattached before follow_master gets a chance to run for that node. It is of course set to CON_DOWN when the failover is triggered, and I would expect it to stay in that state until follow_master reattaches it.
>> I believe, though I’m not 100% certain, that sometimes this comes from node 1.
>> Is this likely a configuration problem, or, is this a bug of some kind? I had a quick look at the code, and don’t see any changes that would impact this since 4.1.4 - but I am of course happy to be wrong about that !
>> We have the following set:
>> auto_failback = on
>> auto_failback_interval = 60
>> --
>> Nathan Ward
>> _______________________________________________
>> pgpool-general mailing list
>> pgpool-general at pgpool.net
>> http://www.pgpool.net/mailman/listinfo/pgpool-general
> 
> Best Regards,
> 
> -- 
> Takuma Hoshiai <hoshiai.takuma at nttcom.co.jp>
>