[pgpool-general: 8725] Re: How does pgpool handle the due-failure problem?

Fri Apr 7 10:55:00 JST 2023

Hi Zhaoxun,

> Hi Tatsuo!
> 
> Thank you for testing.
> 
> In your example, I mean what if now localhost 11002 - the old primary
> postgresql - recovers, noticing standby is down and hence starts to serve
> as the primary with data0.

My answer is don't do that, because 11002 primary does not have the
recent data. You should work on recovering 11003 PostgreSQL as this is
the only server having the latest data.

For this reason I recommend you to have more than 1 standby servers so
that there's a good chance to have at least 1 alive standby server.

> Later, as the old standby recovers, it must
> follow the old primary as standby, therefore loses all the data it updated
> to data1 while the old primary is down.
> 
> Best Regards,
>   Zhaoxun
> 
> On Thu, Apr 6, 2023 at 1:55 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> > Suppose we have two servers, under extreme circumstances two may both
>> fail.
>> > Now that we face 4 possibilities:
>> >
>> > 1) Master fail -> Standby self-promote -> Standby fail -> old Master
>> > recover ?
>> > 2) Master fail -> Standby self-promote -> Standby fail -> Standby and new
>> > Master recover?
>> > 3) Standby fail -> Master fail -> Standby Recover?
>> > 4) Standby fail -> Master fail -> Master recover?
>> >
>> > 1 and 3 are especially hazardous because the only recovered server may
>> view
>> > itself as the current master and hence lose data during its failure
>> time. I
>> > believe when only one server wakes up it should stay and wait for the
>> other
>> > server to recover before negotiating who should be the new master.
>> >
>> > Does pgpool have such a mechanism?
>>
>> For #1 yes.
>>
>> # initial state: primary and standby are up.
>> $ pcp_node_info -w -p 11001
>> localhost 11002 1 0.500000 waiting up primary primary 0 none none
>> 2023-04-06 14:37:42
>> localhost 11003 1 0.500000 waiting up standby standby 0 streaming async
>> 2023-04-06 14:37:42
>>
>> # master fail. stop the primary.
>> $ pg_ctl -D data0 stop
>> waiting for server to shut down.... done
>> server stopped
>>
>> # the primary down and the standby self-promote.
>> $ pcp_node_info -w -p 11001
>> localhost 11002 3 0.500000 down down standby unknown 0 none none
>> 2023-04-06 14:38:27
>> localhost 11003 1 0.500000 waiting up primary primary 0 none none
>> 2023-04-06 14:38:27
>>
>> # the (old) standby fail.
>> $ pg_ctl -D data1 stop
>> waiting for server to shut down.... done
>> server stopped
>> $ pcp_node_info -w -p 11001
>> pcp_node_info -w -p 11001
>> localhost 11002 3 0.500000 down down standby unknown 0 none none
>> 2023-04-06 14:38:27
>> localhost 11003 3 0.500000 down down standby unknown 0 none none
>> 2023-04-06 14:38:55
>>
>> # now pgpool does not accept any connection from clients.
>> $ psql -p 11000 test
>> psql: error: connection to server on socket "/tmp/.s.PGSQL.11000" failed:
>> ERROR:  pgpool is not accepting any new connections
>> DETAIL:  all backend nodes are down, pgpool requires at least one valid
>> node
>> HINT:  repair the backend nodes and restart pgpool
>>
>> #2 is basically same because after both the primary and the stabdby go
>>  down, pgpool won't accept connection from clients.
>>
>> For #3 and #4, I am not sure what you mean. Maybe you mean the case
>> when no failover command is configured (thus no self-promote)? If so,
>> the result is same as #1 and #2.
>>
>> Best reagards,
>> --
>> Tatsuo Ishii
>> SRA OSS LLC
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>