[pgpool-general: 8721] Re: How does pgpool handle the due-failure problem?

Tatsuo Ishii ishii at sraoss.co.jp
Thu Apr 6 16:12:30 JST 2023


Hi Zhaoxun,

> Hi Tatsuo!
> 
> Thank you for testing.
> 
> In your example, I mean what if now localhost 11002 - the old primary
> postgresql - recovers, noticing standby is down and hence starts to serve
> as the primary with data0. Later, as the old standby recovers, it must
> follow the old primary as standby, therefore loses all the data it updated
> to data1 while the old primary is down.

Well, I think that's not a technical problem but operational problem.
When all PostgreSQL goes down, the operator should carefully examine
current situation and decide how to recover from this situation.

If both PostgreSQL database are fine (for example, shutdown by
pg_ctl), then he/she just selects the newest db server (in #1 node 1)
and starts it, then run pcp_recovery_node against node 0.

But what if hardware failure case? he/she may have to recover from
existing backup if both node 0 and node 1 database disk are
gone. That's really depends.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


> Best Regards,
>   Zhaoxun
> 
> On Thu, Apr 6, 2023 at 1:55 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> > Suppose we have two servers, under extreme circumstances two may both
>> fail.
>> > Now that we face 4 possibilities:
>> >
>> > 1) Master fail -> Standby self-promote -> Standby fail -> old Master
>> > recover ?
>> > 2) Master fail -> Standby self-promote -> Standby fail -> Standby and new
>> > Master recover?
>> > 3) Standby fail -> Master fail -> Standby Recover?
>> > 4) Standby fail -> Master fail -> Master recover?
>> >
>> > 1 and 3 are especially hazardous because the only recovered server may
>> view
>> > itself as the current master and hence lose data during its failure
>> time. I
>> > believe when only one server wakes up it should stay and wait for the
>> other
>> > server to recover before negotiating who should be the new master.
>> >
>> > Does pgpool have such a mechanism?
>>
>> For #1 yes.
>>
>> # initial state: primary and standby are up.
>> $ pcp_node_info -w -p 11001
>> localhost 11002 1 0.500000 waiting up primary primary 0 none none
>> 2023-04-06 14:37:42
>> localhost 11003 1 0.500000 waiting up standby standby 0 streaming async
>> 2023-04-06 14:37:42
>>
>> # master fail. stop the primary.
>> $ pg_ctl -D data0 stop
>> waiting for server to shut down.... done
>> server stopped
>>
>> # the primary down and the standby self-promote.
>> $ pcp_node_info -w -p 11001
>> localhost 11002 3 0.500000 down down standby unknown 0 none none
>> 2023-04-06 14:38:27
>> localhost 11003 1 0.500000 waiting up primary primary 0 none none
>> 2023-04-06 14:38:27
>>
>> # the (old) standby fail.
>> $ pg_ctl -D data1 stop
>> waiting for server to shut down.... done
>> server stopped
>> $ pcp_node_info -w -p 11001
>> pcp_node_info -w -p 11001
>> localhost 11002 3 0.500000 down down standby unknown 0 none none
>> 2023-04-06 14:38:27
>> localhost 11003 3 0.500000 down down standby unknown 0 none none
>> 2023-04-06 14:38:55
>>
>> # now pgpool does not accept any connection from clients.
>> $ psql -p 11000 test
>> psql: error: connection to server on socket "/tmp/.s.PGSQL.11000" failed:
>> ERROR:  pgpool is not accepting any new connections
>> DETAIL:  all backend nodes are down, pgpool requires at least one valid
>> node
>> HINT:  repair the backend nodes and restart pgpool
>>
>> #2 is basically same because after both the primary and the stabdby go
>>  down, pgpool won't accept connection from clients.
>>
>> For #3 and #4, I am not sure what you mean. Maybe you mean the case
>> when no failover command is configured (thus no self-promote)? If so,
>> the result is same as #1 and #2.
>>
>> Best reagards,
>> --
>> Tatsuo Ishii
>> SRA OSS LLC
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>


More information about the pgpool-general mailing list