[pgpool-general: 6578] Re: Follow-master problem

Tue Jun 4 09:23:24 JST 2019

Hello,

On Fri, 31 May 2019 08:33:15 +0000
Domen Šetar <domen.setar at izum.si> wrote:

> Hi,
> 
> I'm testing pgpoll installation with three servers according to documentation on the page:  http://www.pgpool.net/docs/latest/en/html/example-cluster.html
> I'm using Centos 7 with PostgreSQL 11.3 and pgpool 4.0.5.
> I have three servers: pgsql1 (primary), pgsql2 (standby), pgsql3 (standby).
> 
> I have following problem: When primary (pgsql1) goes down, first standby (pgsql2) is normaly promoted as primary, but second standby (pgsql3) doesn't follow new primary, it goes to status down instead (postgresql is actualy running normaly on pgsql3):
> 
> -bash-4.2$ psql -h 192.168.56.150 -p 9999 -U pgpool postgres -c "show pool_nodes"
> Password for user pgpool:
> node_id | hostname | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | last_status_change
> ---------+----------+------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
> 0       | pgsql1   | 5432 | down   | 0.333333  | standby | 0          | false             | 0                 | 2019-05-31 09:51:23
> 1       | pgsql2   | 5432 | up     | 0.333333  | primary | 0          | true              | 0                 | 2019-05-31 09:51:23
> 2       | pgsql3   | 5432 | down   | 0.333333  | standby | 0          | false             | 0                 | 2019-05-31 09:51:23
> 
> I found this lines in the pgpool  log file:
> 
> May 31 09:51:23 pgsql3 pgpool[4524]: [207-1] 2019-05-31 09:51:23: pid 4524: LOG:  find_primary_node: primary node is 1
> May 31 09:51:23 pgsql3 pgpool[4524]: [208-1] 2019-05-31 09:51:23: pid 4524: LOG:  find_primary_node: standby node is 2
> May 31 09:51:23 pgsql3 pgpool[4524]: [209-1] 2019-05-31 09:51:23: pid 4524: LOG:  starting follow degeneration. shutdown host pgsql1(5432)
> May 31 09:51:23 pgsql3 pgpool[4524]: [210-1] 2019-05-31 09:51:23: pid 4524: LOG:  starting follow degeneration. shutdown host pgsql3(5432)
> 
> When the follow-master comes, the script checks if postgresql on server pgsql3 is running, and because it doesn't scripte exits:
> 
> May 31 09:51:23 pgsql3 pgpool[5034]: [214-1] 2019-05-31 09:51:23: pid 5034: LOG:  execute command: /etc/pgpool-II/follow_master.sh 2 pgsql3 5432 /var/lib/pgsql/11/data 1 0 pgsql2 0 5432 /var/lib/pgsql/11/data
> May 31 09:51:23 pgsql3 root[5077]: follow_master.sh: start: pg_basebackup for 2
> May 31 09:51:23 pgsql3 root[5078]: + FAILED_NODE_ID=2
> May 31 09:51:23 pgsql3 root[5078]: + FAILED_NODE_HOST=pgsql3
> May 31 09:51:23 pgsql3 root[5078]: + FAILED_NODE_PORT=5432
> May 31 09:51:23 pgsql3 root[5078]: + FAILED_NODE_PGDATA=/var/lib/pgsql/11/data
> May 31 09:51:23 pgsql3 root[5078]: + NEW_MASTER_NODE_ID=1
> May 31 09:51:23 pgsql3 root[5078]: + OLD_MASTER_NODE_ID=0
> May 31 09:51:23 pgsql3 root[5078]: + NEW_MASTER_NODE_HOST=pgsql2
> May 31 09:51:23 pgsql3 root[5078]: + OLD_PRIMARY_NODE_ID=0
> May 31 09:51:23 pgsql3 root[5078]: + NEW_MASTER_NODE_PORT=5432
> May 31 09:51:23 pgsql3 root[5078]: + NEW_MASTER_NODE_PGDATA=/var/lib/pgsql/11/data
> May 31 09:51:23 pgsql3 root[5078]: + PGHOME=/usr/pgsql-11
> May 31 09:51:23 pgsql3 root[5078]: + ARCHIVEDIR=/var/lib/pgsql/archivedir
> May 31 09:51:23 pgsql3 root[5078]: + REPL_USER=repl
> May 31 09:51:23 pgsql3 root[5078]: + PCP_USER=pgpool
> May 31 09:51:23 pgsql3 root[5078]: + PGPOOL_PATH=/usr/bin
> May 31 09:51:23 pgsql3 root[5078]: + PCP_PORT=9898
> May 31 09:51:23 pgsql3 root[5078]: + logger -i -p local1.info follow_master.sh: start: pg_basebackup for 2
> May 31 09:51:23 pgsql3 root[5078]: + ssh -T -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null postgres at pgsql3 /usr/pgsql-11/bin/pg_ctl -w -D /var/lib/pgsql/11/data status
> May 31 09:51:23 pgsql3 root[5078]: + [[ 255 -eq 0 ]]
> May 31 09:51:23 pgsql3 root[5078]: + logger -i -p local1.info follow_master.sh: failed_nod_id=2 is not running. skipping follow master command.
> May 31 09:51:23 pgsql3 root[5082]: follow_master.sh: failed_nod_id=2 is not running. skipping follow master command.
> May 31 09:51:23 pgsql3 root[5078]: + exit 0
> 
> Postgres log file on pgsql3:
> 
> 2019-05-31 09:48:11.142 CEST [4764] LOG:  started streaming WAL from primary at 0/F000000 on timeline 1
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
> HINT:  In a moment you should be able to reconnect to the database and repeat your command.
> 2019-05-31 09:51:18.023 CEST [4764] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
>                 This probably means the server terminated abnormally
>                 before or while processing the request.
> Host key verification failed.
> 2019-05-31 09:51:18.156 CEST [4747] LOG:  invalid record length at 0/F000060: wanted 24, got 0
> 2019-05-31 09:51:18.160 CEST [5012] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
>                 Is the server running on host "pgsql1" (192.168.56.111) and accepting
>                 TCP/IP connections on port 5432?
> Host key verification failed.

It seems the connection was cloesd due to some problem of backend server pgsql3.
I think you should check the postgres log to see what the problem is.

> What is wrong? Can you please help me out of this situation.
> 
> Best regards!
> [izum]
> 
> Domen Šetar
> Computer Systems Support
> IZUM - Institute of Information Science | Prešernova ulica 17 | 2000 Maribor | Slovenia
> T: +386 2 25 20 339 | M: +386 41 676 342 | www.izum.si<http://www.izum.si/> | domen.setar at izum.si<mailto:domen.setar at izum.si>
> 
> 
> 

-- 
Bo Peng <pengbo at sraoss.co.jp>
SRA OSS, Inc. Japan