[pgpool-general: 9051] Re: Failure to perform failover when primary lost network connectivity

Bo Peng pengbo at sraoss.co.jp
Mon Mar 25 17:32:47 JST 2024


Hi,

> Sorry, for late reply. We managed to setup another cluster and it works 
> fine, so there is some issue with the previous cluster.

I think the issue may be caused by the configuration of "restore_command".
If you are using replication slot, you can remove the setting of "restore_command".

https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=e88bb32f5cebfdb5ff09d382034b1b3df88e92c2

> We did find some inconsistency configurations in the earlier cluster and 
> will investigate further on that. However, this lead to some comments 
> where we find, in order for pgpool + postgresql to work correctly, the 
> configuration in each nodes have to be exact mirror of each other. Is 
> this assessment correct?

Yes. All settings except pgpool_node_id are common to all nodes.
 
> After our successful failover test, we also notice, the VIP and DB 
> primary nodes does not follow each other. Is there a configuration we 
> can set so VIP will stick to DB primary node?

VIP is always assigned to leader pgpool not primary DB node.
Unfortunately, Pgpool-II doesn't provide such feature to assign VIP to primary DB node.

> On 2/27/24 8:30 AM, Bo Peng wrote:
> > Hi,
> >
> > Pgpool can handle failover even in case of network failure.
> >
> > I think this issue may be caused by the setting of follow primary script.
> >
> > Could you share the following information?
> >
> > - pgpool.conf
> > - the scripts specified in failover_command and follow_primary_command
> > - pgpool logs
> >
> > On Mon, 26 Feb 2024 10:57:56 +0800
> > Ang Sei Heng <seiheng at sendquick.com> wrote:
> >
> >> Hello,
> >>
> >> We recently embark into Pgpool-II as  failover and HA solution for our
> >> PostgreSQL database cluster. We use the example here:
> >>
> >> https://www.pgpool.net/docs/latest/en/html/example-configs.html
> >>
> >> The cluster is running smoothly and we are able to perform
> >> pcp_recovery_node, pcp_attach_node, pcp_detach_node and pcp_promote_node
> >> without issue.
> >>
> >> However, when we perform a test scenario where we need to simulate a
> >> lost network connectivity at the primary postgres database node, both
> >> standby systems went into shutdown instead of taking over the role of
> >> primary.
> >>
> >>   From the logs, we found these:
> >>
> >> 2024-02-26 09:56:55.009 +08:  pid 20700: LOG:  database system was
> >> interrupted; last known up at 2024-02-26 09:56:52 +08
> >> Warning: Permanently added 'pgnode2' (ED25519) to the list of known hosts.
> >> scp: /opt/postgres/pgpool/data/archivedir/00000039.history: No such file
> >> or directory
> >> 2024-02-26 09:56:55.445 +08:  pid 20700: LOG:  entering standby mode
> >> Warning: Permanently added 'pgnode2' (ED25519) to the list of known hosts.
> >> 2024-02-26 09:56:55.573 +08:  pid 20700: LOG:  restored log file
> >> "00000038.history" from archive
> >> Warning: Permanently added 'pgnode2' (ED25519) to the list of known hosts.
> >> scp: /opt/postgres/pgpool/data/archivedir/0000003800000000000000F2: No
> >> such file or directory
> >> 2024-02-26 09:56:55.759 +08:  pid 20700: LOG:  redo starts at 0/F2000028
> >> Warning: Permanently added 'pgnode2' (ED25519) to the list of known hosts.
> >> scp: /opt/postgres/pgpool/data/archivedir/0000003800000000000000F3: No
> >> such file or directory
> >> 2024-02-26 09:56:55.912 +08:  pid 20700: LOG:  consistent recovery state
> >> reached at 0/F2000138
> >> 2024-02-26 09:56:55.912 +08:  pid 20696: LOG:  database system is ready
> >> to accept read-only connections
> >> Warning: Permanently added 'pgnode2' (ED25519) to the list of known hosts.
> >> scp: /opt/postgres/pgpool/data/archivedir/0000003800000000000000F3: No
> >> such file or directory
> >> 2024-02-26 09:56:56.165 +08:  pid 20720: LOG:  started streaming WAL
> >> from primary at 0/F3000000 on timeline 56
> >> 2024-02-26 09:58:48.097 +08:  pid 20720: FATAL:  terminating walreceiver
> >> due to timeout
> >> ssh: connect to host pgnode2 port 22: No route to host
> >> scp: Connection closed
> >> 2024-02-26 09:58:50.684 +08:  pid 20700: FATAL:  could not restore file
> >> "00000039.history" from archive: child process exited with exit code 255
> >> 2024-02-26 09:58:50.685 +08:  pid 20696: LOG:  startup process (PID
> >> 20700) exited with exit code 1
> >> 2024-02-26 09:58:50.685 +08:  pid 20696: LOG:  terminating any other
> >> active server processes
> >> 2024-02-26 09:58:50.685 +08:  pid 20696: LOG:  shutting down due to
> >> startup process failure
> >> 2024-02-26 09:58:50.687 +08:  pid 20696: LOG:  database system is shut down
> >>
> >> Here, the pgnode2 was the primary database node, when we disconnected
> >> its network connectivity, both standby nodes showed the logs above. It
> >> seems, when the system was trying to perform failover, it tried to copy
> >> the some files from primary node, pgnode2, which no longer accessible.
> >> This cause the failover to fail and resulting the system not able to start.
> >>
> >> We are not able to run pg_ctl start in both standby nodes.
> >>
> >> Our questions here:
> >>
> >> 1) Does this means, in network failure, Pgpool-II will not be able to
> >> handle it?
> >>
> >> 2) Is it possible to disable recovery process of copying files from dead
> >> primary node in the event of network failure (for example, a sudden
> >> power failure)?
> >>
> >>
> >> Any advice is welcome.
> >>
> >>
> >>
> >> Thanks,
> >> Sei Heng
> >>
> >>
> >> -- 
> >>
> >> SendQuick Pte Ltd
> >> 76 Playfair Road, #08-01,
> >> LHK 2 Building, Singapore 367996
> >> Tel: +65 6280 2881
> >> Fax: +65 6280 6882
> >>
> >> CONFIDENTIALITY NOTE:
> >> This email and any files transmitted with it is intended only for the use
> >> of the person(s) to whom it is addressed,and may contain information that
> >> is privileged, confidential and exempt from disclosure under applicable
> >> law. If you are not the intended recipient, please immediately notify the
> >> sender and delete the email. If you are not the intended recipient please
> >> do not disclose, copy, distribute or take any action in reliance on the
> >> contents of this e-mail. Thank you.
> >>
> >> _______________________________________________
> >> pgpool-general mailing list
> >> pgpool-general at pgpool.net
> >> http://www.pgpool.net/mailman/listinfo/pgpool-general
> >>
> >> _______________________________________________
> >> pgpool-general mailing list
> >> pgpool-general at pgpool.net
> >> http://www.pgpool.net/mailman/listinfo/pgpool-general
> >
> 
> -- 
> 
> SendQuick Pte Ltd
> 76 Playfair Road, #08-01,
> LHK 2 Building, Singapore 367996
> Tel: +65 6280 2881
> Fax: +65 6280 6882
> 
> CONFIDENTIALITY NOTE:
> This email and any files transmitted with it is intended only for the use
> of the person(s) to whom it is addressed,and may contain information that
> is privileged, confidential and exempt from disclosure under applicable
> law. If you are not the intended recipient, please immediately notify the
> sender and delete the email. If you are not the intended recipient please
> do not disclose, copy, distribute or take any action in reliance on the
> contents of this e-mail. Thank you.
> 
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general


-- 
Bo Peng <pengbo at sraoss.co.jp>
SRA OSS LLC
TEL: 03-5979-2701 FAX: 03-5979-2702
URL: https://www.sraoss.co.jp/


More information about the pgpool-general mailing list