[pgpool-general: 8720] Re: How does pgpool handle the due-failure problem?

Thu Apr 6 15:22:04 JST 2023

Hi Tatsuo!

Thank you for testing.

In your example, I mean what if now localhost 11002 - the old primary
postgresql - recovers, noticing standby is down and hence starts to serve
as the primary with data0. Later, as the old standby recovers, it must
follow the old primary as standby, therefore loses all the data it updated
to data1 while the old primary is down.

Best Regards,
  Zhaoxun

On Thu, Apr 6, 2023 at 1:55 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> > Suppose we have two servers, under extreme circumstances two may both
> fail.
> > Now that we face 4 possibilities:
> >
> > 1) Master fail -> Standby self-promote -> Standby fail -> old Master
> > recover ?
> > 2) Master fail -> Standby self-promote -> Standby fail -> Standby and new
> > Master recover?
> > 3) Standby fail -> Master fail -> Standby Recover?
> > 4) Standby fail -> Master fail -> Master recover?
> >
> > 1 and 3 are especially hazardous because the only recovered server may
> view
> > itself as the current master and hence lose data during its failure
> time. I
> > believe when only one server wakes up it should stay and wait for the
> other
> > server to recover before negotiating who should be the new master.
> >
> > Does pgpool have such a mechanism?
>
> For #1 yes.
>
> # initial state: primary and standby are up.
> $ pcp_node_info -w -p 11001
> localhost 11002 1 0.500000 waiting up primary primary 0 none none
> 2023-04-06 14:37:42
> localhost 11003 1 0.500000 waiting up standby standby 0 streaming async
> 2023-04-06 14:37:42
>
> # master fail. stop the primary.
> $ pg_ctl -D data0 stop
> waiting for server to shut down.... done
> server stopped
>
> # the primary down and the standby self-promote.
> $ pcp_node_info -w -p 11001
> localhost 11002 3 0.500000 down down standby unknown 0 none none
> 2023-04-06 14:38:27
> localhost 11003 1 0.500000 waiting up primary primary 0 none none
> 2023-04-06 14:38:27
>
> # the (old) standby fail.
> $ pg_ctl -D data1 stop
> waiting for server to shut down.... done
> server stopped
> $ pcp_node_info -w -p 11001
> pcp_node_info -w -p 11001
> localhost 11002 3 0.500000 down down standby unknown 0 none none
> 2023-04-06 14:38:27
> localhost 11003 3 0.500000 down down standby unknown 0 none none
> 2023-04-06 14:38:55
>
> # now pgpool does not accept any connection from clients.
> $ psql -p 11000 test
> psql: error: connection to server on socket "/tmp/.s.PGSQL.11000" failed:
> ERROR:  pgpool is not accepting any new connections
> DETAIL:  all backend nodes are down, pgpool requires at least one valid
> node
> HINT:  repair the backend nodes and restart pgpool
>
> #2 is basically same because after both the primary and the stabdby go
>  down, pgpool won't accept connection from clients.
>
> For #3 and #4, I am not sure what you mean. Maybe you mean the case
> when no failover command is configured (thus no self-promote)? If so,
> the result is same as #1 and #2.
>
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS LLC
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20230406/55eaaa0b/attachment.htm>