[pgpool-general: 7512] Re: Possible pgpool 4.1.4 failover auto_failback race condition
hoshiai.takuma at nttcom.co.jp
Wed Apr 21 17:38:09 JST 2021
Thank you for your information. I understood.
Unfortunately, there is a possibility that this situation is happend
by pgpool. The follow_primary_command is executed asynchronous with
a failover process, so pgpool can't judge why node status is down.
If you want to avoid this on current pgpool, you can take either
1. Do not use auto_failback and follow_primary_command features together
2. Extend auto_failback_interval in order not to happen this problem
I will try that auto_failback is not happened while
follow_primary_command on next version of pgpool.
On 2021/04/18 17:00, Nathan Ward wrote:
> follow_master_command for backend 2 is not executed, because backend 2 auto_failback happens while node 0 follow_master_command is running.
> Note that if backend 0 follow_master_command runs quickly then this doesn’t happen.
> It happens in my case, because backend 0 is not reachable, so ssh times out. I have the ssh login timeout set to 3s (default is 30s), which is enough time for auto_failback to detect that backend 2 is reachable.
> If I test the failure by stopping postgres, or by setting iptables to reject connections rather than dropping packets, then backend 0 follow_master_command returns error immediately and follow_master_command for backend 1 can run before auto_failback brings it back online.
> I think that we probably want a failover to block auto_failback. I recall that auto_failback is scheduled only if in_transition (or something like that) is false - perhaps that flag is not being set, or it is being cleared? I’m not sure, I don’t know the code well enough and I’m working from my memory.. I almost certainly have the name (in_transition) incorrect, as I say, I’m working from memory.. I will check the code later.
>> On 16/04/2021, at 4:18 PM, Takuma Hoshiai <hoshiai.takuma at nttcom.co.jp> wrote:
>> Thank you for your report.
>> Do you mean that failover_command is executed, but postgres
>> node 2 fail back while running follow_master_command?
>> Or follow_master_command is not excuted?
>> I research it. Could you share pgpool.log?
>> On 2021/04/14 14:54, Nathan Ward wrote:
>>> I believe I’ve found a race condition with auto_failback.
>>> In my test environment I have 3 servers each running both pgpool and postgres.
>>> I simulate a network failure with iptables rules on one node.
>>> I start the test with the following state:
>>> pgpool primary: node 2
>>> postgres primary: node 0
>>> When I fail node 0, in order to trigger failover with follow_master (4.1.x still), I find that most of the time node 2 is reattached before follow_master gets a chance to run for that node. It is of course set to CON_DOWN when the failover is triggered, and I would expect it to stay in that state until follow_master reattaches it.
>>> I believe, though I’m not 100% certain, that sometimes this comes from node 1.
>>> Is this likely a configuration problem, or, is this a bug of some kind? I had a quick look at the code, and don’t see any changes that would impact this since 4.1.4 - but I am of course happy to be wrong about that !
>>> We have the following set:
>>> auto_failback = on
>>> auto_failback_interval = 60
>>> Nathan Ward
>>> pgpool-general mailing list
>>> pgpool-general at pgpool.net
>> Best Regards,
>> Takuma Hoshiai <hoshiai.takuma at nttcom.co.jp>
> pgpool-general mailing list
> pgpool-general at pgpool.net
Takuma Hoshiai <hoshiai.takuma at nttcom.co.jp>
More information about the pgpool-general