View Issue Details

IDProjectCategoryView StatusLast Update
0000624Pgpool-IIBugpublic2020-08-07 18:45
Reporterwsz Assigned Tot-ishii  
PrioritynormalSeveritymajorReproducibilityalways
Status assignedResolutionopen 
PlatformLinux x64_86OSCentOSOS Version8.2
Product Version4.1.2 
Target Version4.1.3 
Summary0000624: pcp_recovery_node reports: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start ...
DescriptionDuring execution of https://www.pgpool.net/docs/latest/en/html/tutorial-replication.html
$ pgpool_setup -s -r
...
recovery node 1...ERROR: recovery is checking if postmaster is started
DETAIL: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start in 90 second

During execution of https://www.pgpool.net/docs/latest/en/html/tutorial-testing-online-recovery.html
...
$ pcp_recovery_node -p 11001 -h 127.0.0.1 -n 0
Password:
ERROR: recovery is checking if postmaster is started
DETAIL: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start in 90 second
Remark: node 0 (master is correctly recreated but not started.
Steps To ReproduceInstall and update CentOS Linux release 8.2.2004 (Core)
Next:
dnf install https://download.postgresql.org/pub/repos/yum/reporpms/EL-8-x86_64/pgdg-redhat-repo-latest.noarch.rpm
dnf -qy module disable postgresql
dnf install postgresql12-server

Install pgpool-II-pg12-4.1.2-1pgdg.rhel8.x86_64.rpm and pgpool-II-pg12-extensions-4.1.2-1pgdg.rhel8.x86_64.rpm from your site.
Configure environment of postgres user by adding:
PATH=/usr/pgpool-12/bin:$PATH
export PGPOOL_INSTALL_DIR=/usr
export PGPOOLDIR=/etc/pgpool-II
export PGBIN=/usr/pgsql-12/bin
export PGLIB=/usr/pgsql-12/lib
to make running pgpool_setup.
Add line "wal_keep_segments = 10" in a proper place in pgpool_setup script to allow proper creation of secondary data1 database w2ith streaming replication.
Execute:
$ pgpool_setup -s -r
PostgreSQL major version: 123
Starting set up in streaming replication mode
cp: cannot stat '/etc/pgpool-II/pool_hba.conf.sample': No such file or directory
creating startall and shutdownall
creating failover script
creating database cluster /var/lib/pgsql/pgpool-testws/data0...done.
update postgresql.conf
creating pgpool_remote_start
creating basebackup.sh
creating recovery.conf
creating database cluster /var/lib/pgsql/pgpool-testws/data1...done.
update postgresql.conf
creating pgpool_remote_start
creating basebackup.sh
creating recovery.conf
sed: can't read /var/lib/pgsql/pgpool-testws/etc/pool_hba.conf: No such file or directory
sed: can't read /var/lib/pgsql/pgpool-testws/etc/pool_hba.conf: No such file or directory
temporarily start data0 cluster to create extensions
temporarily start pgpool-II to create standby nodes
 node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0 | /var/run/postgresql | 11002 | up | 0.500000 | primary | 0 | true | 0 | | | 2020-07-08 15:30:44
 1 | /var/run/postgresql | 11003 | down | 0.500000 | standby | 0 | false | 0 | | | 2020-07-08 15:30:43
(2 rows)

recovery node 1...ERROR: recovery is checking if postmaster is started
DETAIL: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start in 90 second

done.
creating follow master script
 node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0 | /var/run/postgresql | 11002 | up | 0.500000 | primary | 0 | true | 0 | | | 2020-07-08 15:30:44
 1 | /var/run/postgresql | 11003 | down | 0.500000 | standby | 0 | false | 0 | | | 2020-07-08 15:30:43
(2 rows)

shutdown all

pgpool-II setting for streaming replication mode is done.
To start the whole system, use /var/lib/pgsql/pgpool-testws/startall.
To shutdown the whole system, use /var/lib/pgsql/pgpool-testws/shutdownall.
pcp command user name is "postgres", password is "postgres".
Each PostgreSQL, pgpool-II and pcp port is as follows:
0000001 port is 11002
0000002 port is 11003
pgpool port is 11000
pcp port is 11001
The info above is in README.port.

$ cd pgpool-testws/
$ ./startall
waiting for server to start....6806 2020-07-08 15:32:55 CEST LOG: starting PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit
6806 2020-07-08 15:32:55 CEST LOG: listening on IPv4 address "0.0.0.0", port 11002
6806 2020-07-08 15:32:55 CEST LOG: listening on IPv6 address "::", port 11002
6806 2020-07-08 15:32:55 CEST LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.11002"
6806 2020-07-08 15:32:55 CEST LOG: redirecting log output to logging collector process
6806 2020-07-08 15:32:55 CEST HINT: Future log output will appear in directory "log".
 done
server started
waiting for server to start....6817 2020-07-08 15:32:55 CEST LOG: starting PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit
6817 2020-07-08 15:32:55 CEST LOG: listening on IPv4 address "0.0.0.0", port 11003
6817 2020-07-08 15:32:55 CEST LOG: listening on IPv6 address "::", port 11003
6817 2020-07-08 15:32:55 CEST LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.11003"
6817 2020-07-08 15:32:55 CEST LOG: redirecting log output to logging collector process
6817 2020-07-08 15:32:55 CEST HINT: Future log output will appear in directory "log".
 done
server started
$ psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0 | /var/run/postgresql | 11002 | up | 0.500000 | primary | 0 | false | 0 | | | 2020-07-08 15:33:07
 1 | /var/run/postgresql | 11003 | up | 0.500000 | standby | 0 | true | 0 | streaming | async | 2020-07-08 15:33:07
(2 rows)

$ /usr/pgsql-12/bin/pg_ctl -D /var/lib/pgsql/pgpool-testws/data0 stop
waiting for server to shut down.... done
server stopped
$ psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0 | /var/run/postgresql | 11002 | down | 0.500000 | standby | 0 | false | 0 | | | 2020-07-08 15:34:41
 1 | /var/run/postgresql | 11003 | up | 0.500000 | primary | 0 | true | 0 | | | 2020-07-08 15:34:41
(2 rows)

$ pcp_recovery_node -p 11001 -n 0
Password:
ERROR: connection to socket "/tmp/.s.PGSQL.11001" failed with error "No such file or directory"
$ pcp_recovery_node -p 11001 -h 127.0.0.1 -n 0
Password:
ERROR: recovery is checking if postmaster is started
DETAIL: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start in 90 second


$ ls data?
data0:
PG_VERSION basebackup.sh global myrecovery.conf pg_dynshmem pg_ident.conf pg_multixact pg_replslot pg_snapshots pg_stat_tmp pg_tblspc pg_wal pgpool_remote_start postgresql.conf
base current_logfiles log pg_commit_ts pg_hba.conf pg_logical pg_notify pg_serial pg_stat pg_subtrans pg_twophase pg_xact postgresql.auto.conf postmaster.opts

data1:
PG_VERSION basebackup.sh log pg_dynshmem pg_logical pg_replslot pg_stat pg_tblspc pg_xact postgresql.auto.conf postmaster.pid
backup_label.old current_logfiles myrecovery.conf pg_hba.conf pg_multixact pg_serial pg_stat_tmp pg_twophase pg_xlog postgresql.conf standby.signal
base global pg_commit_ts pg_ident.conf pg_notify pg_snapshots pg_subtrans pg_wal pgpool_remote_start postmaster.opts
$ /usr/pgsql-12/bin/pg_ctl -D /var/lib/pgsql/pgpool-testws/data0 start
waiting for server to start....7111 2020-07-08 15:41:24 CEST LOG: starting PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit
7111 2020-07-08 15:41:24 CEST LOG: listening on IPv4 address "0.0.0.0", port 11002
7111 2020-07-08 15:41:24 CEST LOG: listening on IPv6 address "::", port 11002
7111 2020-07-08 15:41:24 CEST LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.11002"
7111 2020-07-08 15:41:24 CEST LOG: redirecting log output to logging collector process
7111 2020-07-08 15:41:24 CEST HINT: Future log output will appear in directory "log".
 done
server started
$ psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0 | /var/run/postgresql | 11002 | down | 0.500000 | standby | 0 | false | 0 | | | 2020-07-08 15:34:41
 1 | /var/run/postgresql | 11003 | up | 0.500000 | primary | 0 | true | 0 | | | 2020-07-08 15:34:41
(2 rows)
$ pcp_attach_node -h 127.0.0.1 -p 11001 0
Password:
pcp_attach_node -- Command Successful
[postgres@CentOS81 pgpool-testws]$ psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0 | /var/run/postgresql | 11002 | up | 0.500000 | primary | 0 | true | 0 | | | 2020-07-08 15:42:14
 1 | /var/run/postgresql | 11003 | up | 0.500000 | standby | 0 | false | 56 | | | 2020-07-08 15:42:10
(2 rows)

$
Additional Informationgbench command not executed because is not included in rpm. CentOS installed with LANG=pl_PL.UTF-8 but changed to en_EN.UTF-8 for postgres user.

It will be nice if you make introductory replication procedure working as described and adjust Postgresql files' locations paths to CentOS rpm distribution.

I will not be available from 10 July till 24 July, so I may answer your questions after this time.

Contents of logs:
"failover.log":
Wed Jul 8 15:34:38 CEST 2020
failed_node_id 0 failed_host_name /var/run/postgresql failed_port 11002 failed_db_cluster /var/lib/pgsql/pgpool-testws/data0 new_master_id 1 old_master_id 0 new_master_host_name /var/run/postgresql old_primary_node_id 0 new_master_port_number 11003 new_master_db_cluster /var/lib/pgsql/pgpool-testws/data1
/usr/pgsql-12/bin/pg_ctl -D /var/lib/pgsql/pgpool-testws/data1 promote
waiting for server to promote.... done
server promoted
Wed Jul 8 15:34:41 CEST 2020
failed_node_id 0 failed_host_name /var/run/postgresql failed_port 11002 failed_db_cluster /var/lib/pgsql/pgpool-testws/data0 new_master_id 1 old_master_id 0 new_master_host_name /var/run/postgresql old_primary_node_id 0 new_master_port_number 11003 new_master_db_cluster /var/lib/pgsql/pgpool-testws/data1
/var/lib/pgsql/pgpool-testws/data0 is not running. skipping follow master command.

"pgpool_status":
up
up

"recovery.log":

Wed Jul 8 15:30:44 CEST 2020
pg_rewind starts
pg_rewind: connected to server
pg_rewind: fatal: source and target clusters are from different systems
pg_rewind failed. Fall back to rsync
source: /var/lib/pgsql/pgpool-testws/data0 dest: /var/lib/pgsql/pgpool-testws/data1

Wed Jul 8 15:36:20 CEST 2020
pg_rewind starts
pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 0/30020E0 on timeline 1
pg_rewind: no rewind required

"pgpool.log":
020-07-08 15:32:55: pid 6825: LOG: Backend status file /var/lib/pgsql/pgpool-testws/log/pgpool_status discarded
2020-07-08 15:32:55: pid 6825: LOG: memory cache initialized
2020-07-08 15:32:55: pid 6825: DETAIL: memcache blocks :64
2020-07-08 15:32:55: pid 6825: LOG: pool_discard_oid_maps: discarded memqcache oid maps
2020-07-08 15:32:55: pid 6825: LOG: Setting up socket for ::1:11000
2020-07-08 15:32:55: pid 6825: LOG: Setting up socket for 127.0.0.1:11000
2020-07-08 15:32:55: pid 6825: LOG: find_primary_node_repeatedly: waiting for finding a primary node
2020-07-08 15:32:55: pid 6825: LOG: find_primary_node: primary node is 0
2020-07-08 15:32:55: pid 6825: LOG: find_primary_node: standby node is 1
2020-07-08 15:32:55: pid 6862: LOG: PCP process: 6862 started
2020-07-08 15:32:55: pid 6825: LOG: pgpool-II successfully started. version 4.1.2 (karasukiboshi)
2020-07-08 15:32:55: pid 6825: LOG: node status[0]: 1
2020-07-08 15:32:55: pid 6825: LOG: node status[1]: 2
2020-07-08 15:33:07: pid 6858: LOG: DB node id: 0 backend pid: 6875 statement: SELECT version()
2020-07-08 15:33:07: pid 6858: LOG: pool_reuse_block: blockid: 0
2020-07-08 15:33:07: pid 6858: CONTEXT: while searching system catalog, When relcache is missed
2020-07-08 15:33:07: pid 6858: LOG: DB node id: 0 backend pid: 6875 statement: DISCARD ALL
2020-07-08 15:33:07: pid 6858: LOG: DB node id: 1 backend pid: 6876 statement: DISCARD ALL
2020-07-08 15:34:35: pid 6864: LOG: failed to connect to PostgreSQL server by unix domain socket
2020-07-08 15:34:35: pid 6864: DETAIL: connect to "/var/run/postgresql/.s.PGSQL.11002" failed with error "No such file or directory"
2020-07-08 15:34:35: pid 6864: ERROR: failed to make persistent db connection
2020-07-08 15:34:35: pid 6864: DETAIL: connection to host:"/var/run/postgresql:11002" failed
2020-07-08 15:34:35: pid 6864: LOG: health check retrying on DB node: 0 (round:1)
2020-07-08 15:34:35: pid 6863: LOG: failed to connect to PostgreSQL server by unix domain socket
2020-07-08 15:34:35: pid 6863: DETAIL: connect to "/var/run/postgresql/.s.PGSQL.11002" failed with error "No such file or directory"
2020-07-08 15:34:35: pid 6863: ERROR: failed to make persistent db connection
2020-07-08 15:34:35: pid 6863: DETAIL: connection to host:"/var/run/postgresql:11002" failed
2020-07-08 15:34:36: pid 6864: LOG: failed to connect to PostgreSQL server by unix domain socket
2020-07-08 15:34:36: pid 6864: DETAIL: connect to "/var/run/postgresql/.s.PGSQL.11002" failed with error "No such file or directory"
2020-07-08 15:34:36: pid 6864: ERROR: failed to make persistent db connection
2020-07-08 15:34:36: pid 6864: DETAIL: connection to host:"/var/run/postgresql:11002" failed
2020-07-08 15:34:36: pid 6864: LOG: health check retrying on DB node: 0 (round:2)
2020-07-08 15:34:37: pid 6864: LOG: failed to connect to PostgreSQL server by unix domain socket
2020-07-08 15:34:37: pid 6864: DETAIL: connect to "/var/run/postgresql/.s.PGSQL.11002" failed with error "No such file or directory"
2020-07-08 15:34:37: pid 6864: ERROR: failed to make persistent db connection
2020-07-08 15:34:37: pid 6864: DETAIL: connection to host:"/var/run/postgresql:11002" failed
2020-07-08 15:34:37: pid 6864: LOG: health check retrying on DB node: 0 (round:3)
2020-07-08 15:34:38: pid 6864: LOG: failed to connect to PostgreSQL server by unix domain socket
2020-07-08 15:34:38: pid 6864: DETAIL: connect to "/var/run/postgresql/.s.PGSQL.11002" failed with error "No such file or directory"
2020-07-08 15:34:38: pid 6864: ERROR: failed to make persistent db connection
2020-07-08 15:34:38: pid 6864: DETAIL: connection to host:"/var/run/postgresql:11002" failed
2020-07-08 15:34:38: pid 6864: LOG: health check failed on node 0 (timeout:0)
2020-07-08 15:34:38: pid 6864: LOG: received degenerate backend request for node_id: 0 from pid [6864]
2020-07-08 15:34:38: pid 6825: LOG: Pgpool-II parent process has received failover request
2020-07-08 15:34:38: pid 6825: LOG: starting degeneration. shutdown host /var/run/postgresql(11002)
2020-07-08 15:34:38: pid 6825: LOG: Restart all children
2020-07-08 15:34:38: pid 6825: LOG: execute command: /var/lib/pgsql/pgpool-testws/etc/failover.sh 0 /var/run/postgresql 11002 /var/lib/pgsql/pgpool-testws/data0 1 0 /var/run/postgresql 0 11003 /var/lib/pgsql/pgpool-testws/data1
2020-07-08 15:34:41: pid 6825: LOG: find_primary_node_repeatedly: waiting for finding a primary node
2020-07-08 15:34:41: pid 6825: LOG: find_primary_node: primary node is 1
2020-07-08 15:34:41: pid 6825: LOG: starting follow degeneration. shutdown host /var/run/postgresql(11002)
2020-07-08 15:34:41: pid 6825: LOG: failover: 1 follow backends have been degenerated
2020-07-08 15:34:41: pid 6825: LOG: failover: set new primary node: 1
2020-07-08 15:34:41: pid 6825: LOG: failover: set new master node: 1
2020-07-08 15:34:41: pid 6926: LOG: start triggering follow command.
2020-07-08 15:34:41: pid 6926: LOG: execute command: /var/lib/pgsql/pgpool-testws/etc/follow_master.sh 0 /var/run/postgresql 11002 /var/lib/pgsql/pgpool-testws/data0 1 0 /var/run/postgresql 0 11003 /var/lib/pgsql/pgpool-testws/data1
failover done. shutdown host /var/run/postgresql(11002)2020-07-08 15:34:41: pid 6863: ERROR: Failed to check replication time lag
2020-07-08 15:34:41: pid 6863: DETAIL: No persistent db connection for the node 0
2020-07-08 15:34:41: pid 6863: HINT: check sr_check_user and sr_check_password
2020-07-08 15:34:41: pid 6863: CONTEXT: while checking replication time lag
2020-07-08 15:34:41: pid 6863: LOG: worker process received restart request
2020-07-08 15:34:41: pid 6825: LOG: failover done. shutdown host /var/run/postgresql(11002)
2020-07-08 15:34:41: pid 6961: LOG: DB node id: 1 backend pid: 6963 statement: SELECT version()
2020-07-08 15:34:41: pid 6961: LOG: DB node id: 1 backend pid: 6963 statement: DISCARD ALL
2020-07-08 15:34:42: pid 6862: LOG: restart request received in pcp child process
2020-07-08 15:34:42: pid 6825: LOG: PCP child 6862 exits with status 0 in failover()
2020-07-08 15:34:42: pid 6825: LOG: fork a new PCP child pid 6964 in failover()
2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6828 exits with status 256
2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6828 exited with success and will not be restarted
2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6829 exits with status 256
2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6829 exited with success and will not be restarted
2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6830 exits with status 256
2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6830 exited with success and will not be restarted
2020-07-08 15:34:42: pid 6964: LOG: PCP process: 6964 started
2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6831 exits with status 256
2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6831 exited with success and will not be restarted
...
2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6857 exits with status 256
2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6857 exited with success and will not be restarted
2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6858 exits with status 256
2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6858 exited with success and will not be restarted
2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6859 exits with status 256
...
2020-07-08 15:34:42: pid 6825: LOG: worker child process with pid: 6863 exits with status 256
2020-07-08 15:34:42: pid 6825: LOG: fork a new worker child process with pid: 6965
2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6926 exits with status 0
2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6926 exited with success and will not be restarted
2020-07-08 15:36:20: pid 6964: LOG: forked new pcp worker, pid=7002 socket=8
2020-07-08 15:36:20: pid 7002: LOG: starting recovering node 0
2020-07-08 15:36:20: pid 7002: LOG: executing recovery
2020-07-08 15:36:20: pid 7002: DETAIL: starting recovery command: "SELECT pgpool_recovery('basebackup.sh', 'localhost', '/var/lib/pgsql/pgpool-testws/data0', '11003', 0, '11002')"
2020-07-08 15:36:20: pid 7002: LOG: executing recovery
2020-07-08 15:36:20: pid 7002: DETAIL: disabling statement_timeout
2020-07-08 15:36:20: pid 7002: LOG: node recovery, 1st stage is done
2020-07-08 15:36:20: pid 7002: LOG: checking if postmaster is started
2020-07-08 15:36:20: pid 7002: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres (retry 0 times)
2020-07-08 15:36:20: pid 7002: LOG: checking if postmaster is started
2020-07-08 15:36:20: pid 7002: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres
2020-07-08 15:36:23: pid 7002: LOG: checking if postmaster is started
2020-07-08 15:36:23: pid 7002: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres (retry 1 times)
2020-07-08 15:36:23: pid 7002: LOG: checking if postmaster is started
2020-07-08 15:36:23: pid 7002: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres
2020-07-08 15:36:26: pid 7002: LOG: checking if postmaster is started
2020-07-08 15:36:26: pid 7002: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres (retry 2 times)
2020-07-08 15:36:26: pid 7002: LOG: checking if postmaster is started
2020-07-08 15:36:26: pid 7002: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres
2020-07-08 15:36:29: pid 7002: LOG: checking if postmaster is started
2020-07-08 15:36:29: pid 7002: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres (retry 3 times)
2020-07-08 15:36:29: pid 7002: LOG: checking if postmaster is started
2020-07-08 15:36:29: pid 7002: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres
2020-07-08 15:36:32: pid 7002: LOG: checking if postmaster is started
2020-07-08 15:36:32: pid 7002: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres (retry 0 times)
2020-07-08 15:36:32: pid 7002: LOG: checking if postmaster is started
2020-07-08 15:36:32: pid 7002: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres
2020-07-08 15:36:35: pid 7002: LOG: checking if postmaster is started
2020-07-08 15:36:35: pid 7002: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres (retry 1 times)
2020-07-08 15:36:35: pid 7002: LOG: checking if postmaster is started
2020-07-08 15:36:35: pid 7002: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres
2020-07-08 15:36:38: pid 7002: LOG: checking if postmaster is started
2020-07-08 15:36:38: pid 7002: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres (retry 2 times)
...
2020-07-08 15:38:02: pid 7002: LOG: checking if postmaster is started
2020-07-08 15:38:02: pid 7002: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres (retry 30 times)
2020-07-08 15:38:02: pid 7002: LOG: checking if postmaster is started
2020-07-08 15:38:02: pid 7002: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres
2020-07-08 15:38:05: pid 7002: ERROR: recovery is checking if postmaster is started
2020-07-08 15:38:05: pid 7002: DETAIL: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start in 90 second
2020-07-08 15:38:05: pid 6964: LOG: PCP process with pid: 7002 exit with SUCCESS.
2020-07-08 15:38:05: pid 6964: LOG: PCP process with pid: 7002 exits with status 0
2020-07-08 15:39:41: pid 6825: LOG: child process with pid: 6961 exits with status 256
2020-07-08 15:39:41: pid 6825: LOG: fork a new child process with pid: 7070
2020-07-08 15:41:33: pid 6960: LOG: DB node id: 1 backend pid: 7123 statement: SELECT version()
2020-07-08 15:41:33: pid 6960: LOG: DB node id: 1 backend pid: 7123 statement: DISCARD ALL
2020-07-08 15:42:10: pid 6964: LOG: forked new pcp worker, pid=7137 socket=8
2020-07-08 15:42:10: pid 7137: LOG: received failback request for node_id: 0 from pid [7137]
2020-07-08 15:42:10: pid 6825: LOG: Pgpool-II parent process has received failover request
2020-07-08 15:42:10: pid 6825: LOG: starting fail back. reconnect host /var/run/postgresql(11002)
2020-07-08 15:42:10: pid 6825: LOG: Node 1 is not down (status: 2)
2020-07-08 15:42:10: pid 6964: LOG: PCP process with pid: 7137 exit with SUCCESS.
2020-07-08 15:42:10: pid 6964: LOG: PCP process with pid: 7137 exits with status 0
2020-07-08 15:42:10: pid 6825: LOG: Do not restart children because we are failing back node id 0 host: /var/run/postgresql port: 11002 and we are in streaming replication mode and not all backends were down
2020-07-08 15:42:10: pid 6825: LOG: find_primary_node_repeatedly: waiting for finding a primary node
2020-07-08 15:42:10: pid 6825: LOG: find_primary_node: primary node is 0
2020-07-08 15:42:10: pid 6825: LOG: failover: set new primary node: 0
2020-07-08 15:42:10: pid 6825: LOG: failover: set new master node: 0
2020-07-08 15:42:10: pid 6825: LOG: failback done. reconnect host /var/run/postgresql(11002)
2020-07-08 15:42:10: pid 6965: LOG: worker process received restart request
2020-07-08 15:42:11: pid 6964: LOG: restart request received in pcp child process
2020-07-08 15:42:11: pid 6825: LOG: PCP child 6964 exits with status 0 in failover()
2020-07-08 15:42:11: pid 6825: LOG: fork a new PCP child pid 7140 in failover()
2020-07-08 15:42:11: pid 6825: LOG: worker child process with pid: 6965 exits with status 256
2020-07-08 15:42:11: pid 7140: LOG: PCP process: 7140 started
2020-07-08 15:42:11: pid 6825: LOG: fork a new worker child process with pid: 7141
2020-07-08 15:42:11: pid 7141: LOG: get_query_result falied: status: -2
2020-07-08 15:42:11: pid 7141: CONTEXT: while checking replication time lag
2020-07-08 15:42:14: pid 6960: LOG: failover or failback event detected
2020-07-08 15:42:14: pid 6960: DETAIL: restarting myself
2020-07-08 15:42:14: pid 6959: LOG: failover or failback event detected
2020-07-08 15:42:14: pid 6959: DETAIL: restarting myself
2020-07-08 15:42:14: pid 7070: LOG: selecting backend connection
2020-07-08 15:42:14: pid 7070: DETAIL: failover or failback event detected, discarding existing connections
2020-07-08 15:42:14: pid 6825: LOG: child process with pid: 6959 exits with status 256
2020-07-08 15:42:14: pid 6825: LOG: fork a new child process with pid: 7147
2020-07-08 15:42:14: pid 6825: LOG: child process with pid: 6960 exits with status 256
2020-07-08 15:42:14: pid 6825: LOG: fork a new child process with pid: 7148
2020-07-08 15:42:14: pid 7070: LOG: DB node id: 0 backend pid: 7145 statement: SELECT version()
2020-07-08 15:42:14: pid 7070: LOG: DB node id: 0 backend pid: 7145 statement: DISCARD ALL
2020-07-08 15:42:21: pid 7141: LOG: get_query_result falied: status: -2
2020-07-08 15:42:21: pid 7141: CONTEXT: while checking replication time lag
2020-07-08 15:42:31: pid 7141: LOG: get_query_result falied: status: -2
2020-07-08 15:42:31: pid 7141: CONTEXT: while checking replication time lag
2020-07-08 15:42:42: pid 7141: LOG: get_query_result falied: status: -2
...
Tagsfailover, master slave, pcp commands, pcp_recovery_node

Activities

t-ishii

2020-07-09 11:47

developer   ~0003429

First of all, we recommend to install Pgpool-II rpms from here:
https://pgpool.net/mediawiki/index.php/Yum_Repository
Because they are maintained by active developer of Pgpool-II.

Anyway,

> Add line "wal_keep_segments = 10" in a proper place in pgpool_setup script to allow proper creation of secondary data1 database w2ith streaming replication.
What do you actually modify pgpool_setup. Please share the diff file.
Also please share the PostgreSQL log under data0. While doing recovery, Pgpool-II access primary postgresql and it's important to analyze the PostgreSQL log.

wsz

2020-07-09 17:28

reporter   ~0003430

I added "wal_keep_segments = 10" to make your script pgpool_setup with "-s -r" options to successfully complete the process of establishing standby database. Without this parameter first wal was removed from master pg_wal and standby server could not start. I put this line as temporary solution WITHOUT any detailed analysis how it will work for other options, so please keep this in mind:
$ diff /usr/bin/pgpool_setup pgpool_setup-ws
321a322
> echo "wal_keep_segments = 10" >> $PGCONF

I downloaded 4.1.2-1 version rpms using page: "https://www.pgpool.net/mediawiki/index.php/Downloads" and link for CentOS/RHEL 8, navigating from pgpool.net homepage.Please note that "https://pgpool.net/mediawiki/index.php/Yum_Repository" do not offer rpms for CentOS 8 and older version 4.1-1 is listed for CentOS 7.

Please make your web pages more consistent.

Before using rpms from www.pgpool.net i got the same results using rpms from "https://download.postgresql.org/pub/repos/yum/12/redhat/rhel-8.2-x86_64/ . I got the same errors as described. To verify it is not Postgres distribution problem I removed their rpms and installed yurs from pgpool.net.

I will not be available from tomorrow till 24 July, this is why I described scenario easily to be reproduced by making fresh CentOS and PGPool installations and attached logs from data0 and data1 databases.
data0-Wednesday.log (131,200 bytes)
data1-Wednesday.log (90,150 bytes)

t-ishii

2020-07-10 09:59

developer   ~0003431

I misunderstood that you are not using our Pgpool-II RPMS. Sorry for this. Regarding the our yum repository page, I have sent your request to Pengbo, who is the maintainer of the web page.

I have looked into the log closely and found that postmaster of data0 actually had started as a standby server. However pgpool did not recognize it because the recovery process failed to connect to the postmaster. Interesting thing was, pgpool first tried to connect to postgres database and failed. Then tried template1 database but did not succeed to connect to. This is pretty strange and I imagine there's something specific to your environment (CentOS 8.2). I will try to reproduce the issue when I have spare time.

t-ishii

2020-07-19 20:21

developer   ~0003432

I have successfully run pgpool_setup on CentOS8.2. I installed Pgpool-II from source code. One thing I have to do was:
export PGSOCKET_DIR=/var/run/postgresql
before running pgpool_setup.
Since the default Unix socket directory is /tmp, you need to adjust to the default of the PostgreSQL package.
I didn't have time to try with Pgpool-II rpm though.

wsz

2020-07-29 18:41

reporter   ~0003444

Please note that problem was reported against "pcp_recovery_node" command. I previously used PGSOCKET_DIR variable, but final results were the same as with this variable unset.
I repeated test with the following variables set in ~/.bash_profile:
#export PGPOOL_INSTALL_DIR=/usr/pgpool-12
export PGPOOL_INSTALL_DIR=/usr # PGPools original download
#export PGPOOLDIR=/etc/pgpool-II-12
export PGPOOLDIR=/etc/pgpool-II # PGPools original download
export PGBIN=/usr/pgsql-12/bin
export PGLIB=/usr/pgsql-12/lib
export PGSOCKET_DIR=/var/run/postgresql

and the following variables are set and visible in a session:
$ set |grep PG
PGBIN=/usr/pgsql-12/bin
PGDATA=/var/lib/pgsql/12/data
PGLIB=/usr/pgsql-12/lib
PGPOOLDIR=/etc/pgpool-II
PGPOOL_INSTALL_DIR=/usr
PGSOCKET_DIR=/var/run/postgresql

Now I execute commands:
$ pcp_recovery_node -p 11001 -n 0
Password:
ERROR: connection to socket "/tmp/.s.PGSQL.11001" failed with error "No such file or directory"
IT FAILS imediately. Why it looks for socket in /tmp directory?

NEXT I execute with -h option and it works till the moment of checking connection to 11001:
$ pcp_recovery_node -p 11001 -h 127.0.0.1 -n 0
Password:
ERROR: recovery is checking if postmaster is started
DETAIL: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start in 90 second

This command does its job but FAILS to START data0 Postgres (data1 is running). This is primary problem.
Another one: why it reports socket problem instead of tcp timeout when -h option -s used?

Parts of pgpool.log:
 ...
2020-07-29 11:15:52: pid 2263: LOG: starting recovering node 0
2020-07-29 11:15:52: pid 2263: LOG: executing recovery
2020-07-29 11:15:52: pid 2263: DETAIL: starting recovery command: "SELECT pgpool_recovery('basebackup.sh', 'localhost', '/var/lib/pgsql/pgpool-testws/data0', '11003', 0, '11002')"
2020-07-29 11:15:52: pid 2263: LOG: executing recovery
2020-07-29 11:15:52: pid 2263: DETAIL: disabling statement_timeout
2020-07-29 11:15:52: pid 2263: LOG: node recovery, 1st stage is done
2020-07-29 11:15:52: pid 2263: LOG: checking if postmaster is started
2020-07-29 11:15:52: pid 2263: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres (retry 0 times)
2020-07-29 11:15:52: pid 2263: LOG: checking if postmaster is started
2020-07-29 11:15:52: pid 2263: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres
2020-07-29 11:15:55: pid 2263: LOG: checking if postmaster is started
2020-07-29 11:15:55: pid 2263: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres (retry 1 times)
2020-07-29 11:15:55: pid 2263: LOG: checking if postmaster is started
2020-07-29 11:15:55: pid 2263: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres
2020-07-29 11:15:58: pid 2263: LOG: checking if postmaster is started
2020-07-29 11:15:58: pid 2263: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres (retry 2 times)
...
2020-07-29 11:17:34: pid 2263: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres (retry 30 times)
2020-07-29 11:17:34: pid 2263: LOG: checking if postmaster is started
2020-07-29 11:17:34: pid 2263: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres
2020-07-29 11:17:37: pid 2263: ERROR: recovery is checking if postmaster is started
2020-07-29 11:17:37: pid 2263: DETAIL: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start in 90 second
...

wsz

2020-07-29 18:54

reporter   ~0003445

The line:
"NEXT I execute with -h option and it works till the moment of checking connection to 11001:"
is incorrect. It should be:
"NEXT I execute with -h option and it works till the moment of checking connection to database template1 (data0 probably):"

t-ishii

2020-07-30 17:40

developer   ~0003446

> ERROR: connection to socket "/tmp/.s.PGSQL.11001" failed with error "No such file or directory"
> IT FAILS imediately. Why it looks for socket in /tmp directory?
Yeah, currently pcp clients unconditionally try to connect to socket in /tmp. Probably RPM should patch this.

> ERROR: recovery is checking if postmaster is started
> DETAIL: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start in 90 second
I suspect postmaster did not start. After online recovery succeeds, Pgpool-II tries to start postmaster by using pgpool_remote_start script in data0. In the script ssh is used and it requires postgres user can ssh to localhost without password. i.e. you should be able to something like this without password:

ssh localhost ls

Can you do that?

wsz

2020-07-31 23:09

reporter   ~0003447

It started to work with default settings after establishing ssh access to localhost without password. It is not so obvious for single server installation and is not described in installation manual (in user manual only). Maybe someone could add it to installation manual in pgpool_setup section and it's manual page?

wsz

2020-08-01 00:10

reporter   ~0003448

Unfortunately switching back to working master data0 and slave data1 configuration is not correct when executing comands:
pgpool_setup -s -r
./startall
/usr/pgsql-12/bin/pg_ctl -D /var/lib/pgsql/pgpool-testws/data0 stop
pcp_recovery_node -p 11001 -h 127.0.0.1 -n 0
data0 and data1 are not in sync after.
Copy of screen:
[postgres@CentOS81 pgpool-testws]$ pgpool_setup -s -r
PostgreSQL major version: 123
Starting set up in streaming replication mode
cp: cannot stat '/etc/pgpool-II/pool_hba.conf.sample': No such file or directory
creating startall and shutdownall
creating failover script
creating database cluster /var/lib/pgsql/pgpool-testws/data0...done.
update postgresql.conf
creating pgpool_remote_start
creating basebackup.sh
creating recovery.conf
creating database cluster /var/lib/pgsql/pgpool-testws/data1...done.
update postgresql.conf
creating pgpool_remote_start
creating basebackup.sh
creating recovery.conf
sed: can't read /var/lib/pgsql/pgpool-testws/etc/pool_hba.conf: No such file or directory
sed: can't read /var/lib/pgsql/pgpool-testws/etc/pool_hba.conf: No such file or directory
temporarily start data0 cluster to create extensions
temporarily start pgpool-II to create standby nodes
 node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change

---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------
-
 0 | /var/run/postgresql | 11002 | up | 0.500000 | primary | 0 | true | 0 | | | 2020-07-31 16:45:04
 1 | /var/run/postgresql | 11003 | down | 0.500000 | standby | 0 | false | 0 | | | 2020-07-31 16:45:03
(2 rows)

recovery node 1...pcp_recovery_node -- Command Successful
done.
creating follow master script
 node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0 | /var/run/postgresql | 11002 | up | 0.500000 | primary | 0 | false | 0 | | | 2020-07-31 16:45:04
 1 | /var/run/postgresql | 11003 | up | 0.500000 | standby | 0 | true | 0 | | | 2020-07-31 16:45:08
(2 rows)

shutdown all

pgpool-II setting for streaming replication mode is done.
To start the whole system, use /var/lib/pgsql/pgpool-testws/startall.
To shutdown the whole system, use /var/lib/pgsql/pgpool-testws/shutdownall.
pcp command user name is "postgres", password is "postgres".
Each PostgreSQL, pgpool-II and pcp port is as follows:
0000001 port is 11002
0000002 port is 11003
pgpool port is 11000
pcp port is 11001
The info above is in README.port.
[postgres@CentOS81 pgpool-testws]$ ./startall
waiting for server to start....5825 2020-07-31 16:45:18 CEST LOG: starting PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit
5825 2020-07-31 16:45:18 CEST LOG: listening on IPv4 address "0.0.0.0", port 11002
5825 2020-07-31 16:45:18 CEST LOG: listening on IPv6 address "::", port 11002
5825 2020-07-31 16:45:18 CEST LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.11002"
5825 2020-07-31 16:45:18 CEST LOG: redirecting log output to logging collector process
5825 2020-07-31 16:45:18 CEST HINT: Future log output will appear in directory "log".
 done
server started
waiting for server to start....5836 2020-07-31 16:45:19 CEST LOG: starting PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit
5836 2020-07-31 16:45:19 CEST LOG: listening on IPv4 address "0.0.0.0", port 11003
5836 2020-07-31 16:45:19 CEST LOG: listening on IPv6 address "::", port 11003
5836 2020-07-31 16:45:19 CEST LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.11003"
5836 2020-07-31 16:45:19 CEST LOG: redirecting log output to logging collector process
5836 2020-07-31 16:45:19 CEST HINT: Future log output will appear in directory "log".
 done
server started
[postgres@CentOS81 pgpool-testws]$ psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change

---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------
-
 0 | /var/run/postgresql | 11002 | up | 0.500000 | primary | 0 | false | 0 | | | 2020-07-31 16:45:52
 1 | /var/run/postgresql | 11003 | up | 0.500000 | standby | 0 | true | 0 | streaming | async | 2020-07-31 16:45:52
(2 rows)

[postgres@CentOS81 pgpool-testws]$ /usr/pgsql-12/bin/pg_ctl -D /var/lib/pgsql/pgpool-testws/data0 stop
waiting for server to shut down.... done
server stopped
[postgres@CentOS81 pgpool-testws]$ psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change

---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------
-
 0 | /var/run/postgresql | 11002 | down | 0.500000 | standby | 0 | false | 0 | | | 2020-07-31 16:46:14
 1 | /var/run/postgresql | 11003 | up | 0.500000 | primary | 0 | true | 0 | | | 2020-07-31 16:46:14
(2 rows)

[postgres@CentOS81 pgpool-testws]$ pcp_recovery_node -p 11001 -h 127.0.0.1 -n 0
Password:
pcp_recovery_node -- Command Successful
[postgres@CentOS81 pgpool-testws]$ psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change

---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------
-
 0 | /var/run/postgresql | 11002 | up | 0.500000 | primary | 0 | false | 0 | | | 2020-07-31 16:46:53
 1 | /var/run/postgresql | 11003 | up | 0.500000 | standby | 0 | true | 0 | | | 2020-07-31 16:46:47
(2 rows)

[postgres@CentOS81 pgpool-testws]$ psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change

---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------
-
 0 | /var/run/postgresql | 11002 | up | 0.500000 | primary | 0 | false | 0 | | | 2020-07-31 16:46:53
 1 | /var/run/postgresql | 11003 | up | 0.500000 | standby | 0 | true | 56 | | | 2020-07-31 16:46:47
(2 rows)

[postgres@CentOS81 pgpool-testws]$ ls data?
data0:
PG_VERSION current_logfiles myrecovery.conf pg_hba.conf pg_multixact pg_serial pg_stat_tmp pg_twophase pgpool_remote_start postmaster.opts
base global pg_commit_ts pg_ident.conf pg_notify pg_snapshots pg_subtrans pg_wal postgresql.auto.conf postmaster.pid
basebackup.sh log pg_dynshmem pg_logical pg_replslot pg_stat pg_tblspc pg_xact postgresql.conf

data1:
PG_VERSION basebackup.sh log pg_dynshmem pg_logical pg_replslot pg_stat pg_tblspc pg_xact postgresql.auto.conf postmaster.pid
backup_label.old current_logfiles myrecovery.conf pg_hba.conf pg_multixact pg_serial pg_stat_tmp pg_twophase pg_xlog postgresql.conf standby.signal
base global pg_commit_ts pg_ident.conf pg_notify pg_snapshots pg_subtrans pg_wal pgpool_remote_start postmaster.opts
[postgres@CentOS81 pgpool-testws]$ vi data1/log/Friday.log
[postgres@CentOS81 pgpool-testws]$ ls -l data?/pg_wal
data0/pg_wal:
total 32768
-rw-------. 1 postgres postgres 16777216 Jul 31 16:47 000000010000000000000003
-rw-------. 1 postgres postgres 16777216 Jul 31 16:45 000000010000000000000004
drwx------. 2 postgres postgres 6 Jul 31 16:45 archive_status

data1/pg_wal:
total 49156
-rw-------. 1 postgres postgres 16777216 Jul 31 16:46 000000010000000000000003
-rw-------. 1 postgres postgres 41 Jul 31 16:46 00000002.history
-rw-------. 1 postgres postgres 16777216 Jul 31 16:46 000000020000000000000003
-rw-------. 1 postgres postgres 16777216 Jul 31 16:45 000000020000000000000004
drwx------. 2 postgres postgres 6 Jul 31 16:45 archive_status
[postgres@CentOS81 pgpool-testws]$

Logs included as attachements
Friday-pgpool.log (46,273 bytes)
data0-Friday.log (37,834 bytes)
data1-Friday.log (29,247 bytes)

t-ishii

2020-08-04 08:21

developer   ~0003451

> Unfortunately switching back to working master data0 and slave data1 configuration is not correct when executing comands:
What do you mean by "is not correct"? Can you elaborate more?

wsz

2020-08-04 17:24

reporter   ~0003452

After ececuting commnad "pcp_recovery_node -p 11001 -h 127.0.0.1 -n 0" data0 is switched to master and data1 to slave but process is incorrect because data1 is on timeline 2 and data0 on timeline 1, so replication is broken. See data 1 log after restart of both databases (nothing was done since my last post):
1637 2020-08-04 10:02:21 CEST LOG: database system was shut down at 2020-07-31 17:12:39 CEST
1637 2020-08-04 10:02:21 CEST LOG: entering standby mode
1637 2020-08-04 10:02:21 CEST LOG: consistent recovery state reached at 0/3002238
1637 2020-08-04 10:02:21 CEST LOG: invalid record length at 0/3002238: wanted 24, got 0
1635 2020-08-04 10:02:21 CEST LOG: database system is ready to accept read only connections
1641 2020-08-04 10:02:21 CEST FATAL: highest timeline 1 of the primary is behind recovery timeline 2
1643 2020-08-04 10:02:21 CEST FATAL: highest timeline 1 of the primary is behind recovery timeline 2
1681 2020-08-04 10:02:21 CEST LOG: statement: SELECT pg_is_in_recovery()
1681 2020-08-04 10:02:21 CEST LOG: statement: SELECT current_setting('server_version_num')
1689 2020-08-04 10:02:21 CEST LOG: statement: SELECT current_setting('server_version_num')
1689 2020-08-04 10:02:21 CEST LOG: statement: SELECT pg_last_wal_replay_lsn()
1689 2020-08-04 10:02:21 CEST LOG: statement: SELECT pg_is_in_recovery()
1690 2020-08-04 10:02:26 CEST FATAL: highest timeline 1 of the primary is behind recovery timeline 2
1693 2020-08-04 10:02:31 CEST FATAL: highest timeline 1 of the primary is behind recovery timeline 2
...

pg_wal directories contents:
$ ls -l data?/pg_wal
data0/pg_wal:
total 32768
-rw-------. 1 postgres postgres 16777216 Aug 4 10:07 000000010000000000000003
-rw-------. 1 postgres postgres 16777216 Jul 31 16:45 000000010000000000000004
drwx------. 2 postgres postgres 6 Jul 31 16:45 archive_status

data1/pg_wal:
total 49156
-rw-------. 1 postgres postgres 16777216 Jul 31 16:46 000000010000000000000003
-rw-------. 1 postgres postgres 41 Jul 31 16:46 00000002.history
-rw-------. 1 postgres postgres 16777216 Jul 31 17:12 000000020000000000000003
-rw-------. 1 postgres postgres 16777216 Jul 31 16:45 000000020000000000000004
drwx------. 2 postgres postgres 6 Jul 31 16:45 archive_status

Additionally: From my post above: commnad psql -p 11000 -c "show pool_nodes" test shows state of replication as:
Status just after executing "pgpool_setup -s -r" command:
0 | /var/run/postgresql | 11002 | up | 0.500000 | primary | 0 | false | 0 | | | 2020-07-31 16:45:52
1 | /var/run/postgresql | 11003 | up | 0.500000 | standby | 0 | true | 0 | streaming | async | 2020-07-31 16:45:52

and after executing command "pcp_recovery_node -p 11001 -h 127.0.0.1 -n 0":
0 | /var/run/postgresql | 11002 | up | 0.500000 | primary | 0 | false | 0 | | | 2020-07-31 16:46:53
1 | /var/run/postgresql | 11003 | up | 0.500000 | standby | 0 | true | 0 | | | 2020-07-31 16:46:47
There is no "streaming | async" reported.

t-ishii

2020-08-04 21:39

developer   ~0003453

Last edited: 2020-08-04 21:40

I was able to succeed in reproducing the problem. It seems the issue happens only when "-r" (use pg_rewind for recovery).
In log/recovery.log:
Tue Aug 4 21:34:40 JST 2020
pg_rewind starts
pg_rewind: connected to server
pg_rewind: fatal: source and target clusters are from different systems
pg_rewind failed. Fall back to rsync
source: /home/t-ishii/work/Pgpool-II/current/c/data0 dest: /home/t-ishii/work/Pgpool-II/current/c/data1

Tue Aug 4 21:35:32 JST 2020
pg_rewind starts
pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 0/50000A0 on timeline 1
pg_rewind: no rewind required

So it seems there's something wrong with the usage of pg_rewind in the data[01]/basebackeup.sh script which was generated by pgpool_setup. I will look into this.

t-ishii

2020-08-05 09:26

developer   ~0003454

It turned out that the problem we are seeing is caused by a bug in pgpool_setup.
When pgpool_setup using pg_rewind, it did not create standby.signal file on the target node (node 0). Instead it created the file on the primary node (node 1). As a result the recovered node booted up as a primary node, rather than a standby node.
In the case when pg_rewind is not used, the standby.signal file created on node 1 is rsynced to the node 0, thus the node booted up as a standby node.

I attached the diff file to fix the problem. Can you please try it out?
pgpool_setup.diff (518 bytes)   
pgpool_setup.diff (518 bytes)   

wsz

2020-08-05 18:35

reporter   ~0003455

After applying patch (to pgpool_setup from rpm distribution) is better but not OK.
See session log with my comments in capital letters:
$ psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0 | /var/run/postgresql | 11002 | up | 0.500000 | primary | 0 | false | 0 | | | 2020-08-05 11:15:22
 1 | /var/run/postgresql | 11003 | up | 0.500000 | standby | 0 | true | 0 | streaming | async | 2020-08-05 11:15:22
(2 rows)
# TILL NOW OK.
$ /usr/pgsql-12/bin/pg_ctl -D /var/lib/pgsql/pgpool-testws/data0 stop
waiting for server to shut down.... done
server stopped
$ psql -p 11000 -c "show pool_nodes" test
psql: error: could not connect to server: FATAL: failed to create a backend connection
DETAIL: executing failover on backend
# TILL NOW OK.
$ psql -p 11000 -c "show pool_nodes" test
psql: error: could not connect to server: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
# I DO NOT LIKE SUCH MESSAGES EVEN IF THEY ARE TEMPORARY.
$ psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0 | /var/run/postgresql | 11002 | down | 0.500000 | standby | 0 | false | 0 | | | 2020-08-05 11:15:38
 1 | /var/run/postgresql | 11003 | up | 0.500000 | primary | 0 | true | 0 | | | 2020-08-05 11:15:38
(2 rows)

# TILL NOW OK.
$ pcp_recovery_node -p 11001 -h 127.0.0.1 -n 0
Password:
pcp_recovery_node -- Command Successful
$ psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0 | /var/run/postgresql | 11002 | up | 0.500000 | standby | 0 | false | 0 | streaming | async | 2020-08-05 11:16:53
 1 | /var/run/postgresql | 11003 | up | 0.500000 | primary | 0 | true | 0 | | | 2020-08-05 11:15:38
(2 rows)

# TILL NOW LOOKS OK BUT ...

$ psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0 | /var/run/postgresql | 11002 | up | 0.500000 | standby | 0 | true | 0 | streaming | async | 2020-08-05 11:16:53
 1 | /var/run/postgresql | 11003 | up | 0.500000 | primary | 0 | false | 0 | | | 2020-08-05 11:15:38
(2 rows)

# IS LOAD_BALANCE_MODE CHANGED/UPDATED WITH DELAY? BUT IT SEEMS TO BE OK.
$ ./shutdownall
2020-08-05 11:17:57: pid 30490: LOG: stop request sent to pgpool. waiting for termination...
.done.
waiting for server to shut down.... done
server stopped
waiting for server to shut down.... done
server stopped
#UP TO THIS MOMENT!
$ ./startall
waiting for server to start....30507 2020-08-05 11:18:03 CEST LOG: starting PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit
30507 2020-08-05 11:18:03 CEST LOG: listening on IPv4 address "0.0.0.0", port 11002
30507 2020-08-05 11:18:03 CEST LOG: listening on IPv6 address "::", port 11002
30507 2020-08-05 11:18:03 CEST LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.11002"
30507 2020-08-05 11:18:03 CEST LOG: redirecting log output to logging collector process
30507 2020-08-05 11:18:03 CEST HINT: Future log output will appear in directory "log".
 done
server started
waiting for server to start....30517 2020-08-05 11:18:03 CEST LOG: starting PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit
30517 2020-08-05 11:18:03 CEST LOG: listening on IPv4 address "0.0.0.0", port 11003
30517 2020-08-05 11:18:03 CEST LOG: listening on IPv6 address "::", port 11003
30517 2020-08-05 11:18:03 CEST LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.11003"
30517 2020-08-05 11:18:03 CEST LOG: redirecting log output to logging collector process
30517 2020-08-05 11:18:03 CEST HINT: Future log output will appear in directory "log".
 done
server started
$ psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0 | /var/run/postgresql | 11002 | up | 0.500000 | standby | 0 | false | 0 | | | 2020-08-05 11:18:07
 1 | /var/run/postgresql | 11003 | up | 0.500000 | standby | 0 | true | 0 | | | 2020-08-05 11:18:07
(2 rows)

]$ ls -l data?/standby.signal
-rw-------. 1 postgres postgres 0 Aug 5 11:16 data0/standby.signal
-rw-------. 1 postgres postgres 0 Aug 5 11:16 data1/standby.signal
$
# WE ENDED WITH TWO STANDBY DATABASES AFTER RESTART. Another patch is welcome.

t-ishii

2020-08-05 19:07

developer   ~0003456

Oops. Previous patch forgot to remove standby.signal after recovery. Attached is the patch to fix that.
It should be applied based on previous patch.
pgpool_setup_v2.diff (355 bytes)   
pgpool_setup_v2.diff (355 bytes)   

wsz

2020-08-05 21:46

reporter   ~0003457

Now it works. But in patched "pgpool_setup" there are probably lines with duplicated functionality:
lines 427, 428:
touch standby.signal
touch $DEST_CLUSTER/standby.signal

line 453:
rm standby.signal - why not rm $DEST_CLUSTER/standby.signal ?

t-ishii

2020-08-06 09:30

developer   ~0003458

Last edited: 2020-08-06 09:31

> Now it works. But in patched "pgpool_setup" there are probably lines with duplicated functionality:
> lines 427, 428:
> touch standby.signal
> touch $DEST_CLUSTER/standby.signal
They are both necessary. The former is necessary when pg_rewind is not used or failed. In this case rsync will copy standby.signal to the $DEST_CLUSTER. The latter is used when pg_rewind succeeded.
Just leave the latter one is not good because when rsync is used, rsync removes the file.

> line 453:
> rm standby.signal - why not rm $DEST_CLUSTER/standby.signal ?
If you do that, $DEST_CUSTER will start up as a primary node.

Thinking more, I realized that I can reduce some of the code by putting:
touch $DEST_CLUSTER/standby.signal
At the end of creating script.
Attached is the patch for this. It should be applied on top of pgpool_setup.diff (not pgpool_setup_v2.diff).

pgpool_setup_v3.diff (734 bytes)   
pgpool_setup_v3.diff (734 bytes)   

wsz

2020-08-07 00:12

reporter   ~0003459

It works OK with original pgpool_setup with pgpool_setup.diff and pgpool_setup_v3.diff patches applied.

Just for testing I disabled sockets for data0 and data1 configurations (usefull for installations where Application server is on separate host from PGpool and Postgres) and adjusted backend_hostname[01] in pgpool.conf to use IP address instead of sockets. After update of basebackup.sh and myrecovery.conf (file and relaed part of basebackup.sh) adding "host=127.0.0.1" to:

$pg_rewind -P -D $DEST_CLUSTER --source-server="host=127.0.0.1 port=$PORT user=$PGSUPERUSER dbname=postgres" >> $log 2>&1
primary_conninfo = 'host=127.0.0.1 port=$PORT user=$PGSUPERUSER application_name=''server$recovery_node'''

It worked well.
Of course 127.0.0.1 should be replaced with values taken from variables in real implementation. What about such extension of pgpool_setup and created basebackup.sh, myrecovery.conf scripts? It will be more universal.

t-ishii

2020-08-07 12:14

developer   ~0003460

Thanks. Patch committed.

> Of course 127.0.0.1 should be replaced with values taken from variables in real implementation.
In addition to this every occurrence of lines including $DEST should be adjusted. i.e

rm -fr $DEST_CLUSTER/pg_xlog
should be changed to:
ssh hostname rm -fr $DEST_CLUSTER/pg_xlog

> What about such extension of pgpool_setup and created basebackup.sh, myrecovery.conf scripts? It will be more universal.
But we already have more production ready examples:
http://localhost/~t-ishii/pgpool-II/html/example-cluster.html#EXAMPLE-CLUSTER-PGPOOL-CONFIG

What's a benefit to extend pgpool_setup to produce a production ready example?

wsz

2020-08-07 17:40

reporter   ~0003461

It is https://www.pgpool.net/docs/latest/en/html/examples.html probably (not on your localhost).
I can't see example of basebackup.sh script there.

Another question: who is responsible for maintaining rpm distributions from https://www.pgpool.net/mediawiki/index.php/Downloads or i.e. from CentOS distro?
Just to correct bugs like:
$ pcp_recovery_node -p 11001 -n 0
Password:
ERROR: connection to socket "/tmp/.s.PGSQL.11001" failed with error "No such file or directory"

This one can be fixed probably by discovering OS during rpm creation from sources.

t-ishii

2020-08-07 18:45

developer   ~0003462

Oops. Yes it is:
 It is https://www.pgpool.net/docs/latest/en/html/examples.html
In the doc baseback.sh is called "recovery_1st_stage ".

> This one can be fixed probably by discovering OS during rpm creation from sources.
It's already fixed.
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=d7c7578148f0026bbe439f8920c62cb809206466
It will be included in the next RPM releases. (supposed to be released on August 20).

Issue History

Date Modified Username Field Change
2020-07-08 23:50 wsz New Issue
2020-07-08 23:50 wsz Tag Attached: failover
2020-07-08 23:50 wsz Tag Attached: master slave
2020-07-08 23:50 wsz Tag Attached: pcp commands
2020-07-08 23:50 wsz Tag Attached: pcp_recovery_node
2020-07-09 11:47 t-ishii Note Added: 0003429
2020-07-09 11:48 t-ishii Assigned To => t-ishii
2020-07-09 11:48 t-ishii Status new => feedback
2020-07-09 17:28 wsz File Added: data0-Wednesday.log
2020-07-09 17:28 wsz File Added: data1-Wednesday.log
2020-07-09 17:28 wsz Note Added: 0003430
2020-07-09 17:28 wsz Status feedback => assigned
2020-07-10 09:59 t-ishii Note Added: 0003431
2020-07-19 20:21 t-ishii Note Added: 0003432
2020-07-19 20:22 t-ishii Status assigned => feedback
2020-07-29 18:41 wsz Note Added: 0003444
2020-07-29 18:41 wsz Status feedback => assigned
2020-07-29 18:54 wsz Note Added: 0003445
2020-07-30 17:40 t-ishii Note Added: 0003446
2020-07-30 17:41 t-ishii Status assigned => feedback
2020-07-31 23:09 wsz Note Added: 0003447
2020-07-31 23:09 wsz Status feedback => assigned
2020-08-01 00:10 wsz File Added: Friday-pgpool.log
2020-08-01 00:10 wsz File Added: data0-Friday.log
2020-08-01 00:10 wsz File Added: data1-Friday.log
2020-08-01 00:10 wsz Note Added: 0003448
2020-08-04 08:21 t-ishii Note Added: 0003451
2020-08-04 08:21 t-ishii Status assigned => feedback
2020-08-04 17:24 wsz Note Added: 0003452
2020-08-04 17:24 wsz Status feedback => assigned
2020-08-04 21:39 t-ishii Note Added: 0003453
2020-08-04 21:40 t-ishii Note Edited: 0003453
2020-08-05 09:26 t-ishii File Added: pgpool_setup.diff
2020-08-05 09:26 t-ishii Note Added: 0003454
2020-08-05 09:26 t-ishii Status assigned => feedback
2020-08-05 10:53 t-ishii Target Version => 4.1.3
2020-08-05 18:35 wsz Note Added: 0003455
2020-08-05 18:35 wsz Status feedback => assigned
2020-08-05 19:07 t-ishii File Added: pgpool_setup_v2.diff
2020-08-05 19:07 t-ishii Note Added: 0003456
2020-08-05 21:46 wsz Note Added: 0003457
2020-08-06 09:30 t-ishii File Added: pgpool_setup_v3.diff
2020-08-06 09:30 t-ishii Note Added: 0003458
2020-08-06 09:31 t-ishii Note Edited: 0003458
2020-08-07 00:12 wsz Note Added: 0003459
2020-08-07 12:14 t-ishii Note Added: 0003460
2020-08-07 17:40 wsz Note Added: 0003461
2020-08-07 18:45 t-ishii Note Added: 0003462