[pgpool-general: 8719] Re: How does pgpool handle the due-failure problem?

Thu Apr 6 14:55:45 JST 2023

> Suppose we have two servers, under extreme circumstances two may both fail.
> Now that we face 4 possibilities:
> 
> 1) Master fail -> Standby self-promote -> Standby fail -> old Master
> recover ?
> 2) Master fail -> Standby self-promote -> Standby fail -> Standby and new
> Master recover?
> 3) Standby fail -> Master fail -> Standby Recover?
> 4) Standby fail -> Master fail -> Master recover?
> 
> 1 and 3 are especially hazardous because the only recovered server may view
> itself as the current master and hence lose data during its failure time. I
> believe when only one server wakes up it should stay and wait for the other
> server to recover before negotiating who should be the new master.
> 
> Does pgpool have such a mechanism?

For #1 yes.

# initial state: primary and standby are up.
$ pcp_node_info -w -p 11001
localhost 11002 1 0.500000 waiting up primary primary 0 none none 2023-04-06 14:37:42
localhost 11003 1 0.500000 waiting up standby standby 0 streaming async 2023-04-06 14:37:42

# master fail. stop the primary.
$ pg_ctl -D data0 stop
waiting for server to shut down.... done
server stopped

# the primary down and the standby self-promote.
$ pcp_node_info -w -p 11001
localhost 11002 3 0.500000 down down standby unknown 0 none none 2023-04-06 14:38:27
localhost 11003 1 0.500000 waiting up primary primary 0 none none 2023-04-06 14:38:27

# the (old) standby fail.
$ pg_ctl -D data1 stop
waiting for server to shut down.... done
server stopped
$ pcp_node_info -w -p 11001
pcp_node_info -w -p 11001
localhost 11002 3 0.500000 down down standby unknown 0 none none 2023-04-06 14:38:27
localhost 11003 3 0.500000 down down standby unknown 0 none none 2023-04-06 14:38:55

# now pgpool does not accept any connection from clients.
$ psql -p 11000 test
psql: error: connection to server on socket "/tmp/.s.PGSQL.11000" failed: ERROR:  pgpool is not accepting any new connections
DETAIL:  all backend nodes are down, pgpool requires at least one valid node
HINT:  repair the backend nodes and restart pgpool

#2 is basically same because after both the primary and the stabdby go
 down, pgpool won't accept connection from clients.

For #3 and #4, I am not sure what you mean. Maybe you mean the case
when no failover command is configured (thus no self-promote)? If so,
the result is same as #1 and #2.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp