0000624: pcp_recovery_node reports: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start ...

ID	Project	Category	View Status	Date Submitted	Last Update

0000624	Pgpool-II	Bug	public	2020-07-08 23:50	2020-08-07 18:45

Reporter	wsz	Assigned To	t-ishii
Priority	normal	Severity	major	Reproducibility	always
Status	assigned	Resolution	open
Platform	Linux x64_86	OS	CentOS	OS Version	8.2
Product Version	4.1.2
Target Version	4.1.3

Summary	0000624: pcp_recovery_node reports: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start ...
Description	During execution of https://www.pgpool.net/docs/latest/en/html/tutorial-replication.html $ pgpool_setup -s -r ... recovery node 1...ERROR: recovery is checking if postmaster is started DETAIL: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start in 90 second During execution of https://www.pgpool.net/docs/latest/en/html/tutorial-testing-online-recovery.html ... $ pcp_recovery_node -p 11001 -h 127.0.0.1 -n 0 Password: ERROR: recovery is checking if postmaster is started DETAIL: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start in 90 second Remark: node 0 (master is correctly recreated but not started.
Steps To Reproduce	Install and update CentOS Linux release 8.2.2004 (Core) Next: dnf install https://download.postgresql.org/pub/repos/yum/reporpms/EL-8-x86_64/pgdg-redhat-repo-latest.noarch.rpm dnf -qy module disable postgresql dnf install postgresql12-server Install pgpool-II-pg12-4.1.2-1pgdg.rhel8.x86_64.rpm and pgpool-II-pg12-extensions-4.1.2-1pgdg.rhel8.x86_64.rpm from your site. Configure environment of postgres user by adding: PATH=/usr/pgpool-12/bin:$PATH export PGPOOL_INSTALL_DIR=/usr export PGPOOLDIR=/etc/pgpool-II export PGBIN=/usr/pgsql-12/bin export PGLIB=/usr/pgsql-12/lib to make running pgpool_setup. Add line "wal_keep_segments = 10" in a proper place in pgpool_setup script to allow proper creation of secondary data1 database w2ith streaming replication. Execute: $ pgpool_setup -s -r PostgreSQL major version: 123 Starting set up in streaming replication mode cp: cannot stat '/etc/pgpool-II/pool_hba.conf.sample': No such file or directory creating startall and shutdownall creating failover script creating database cluster /var/lib/pgsql/pgpool-testws/data0...done. update postgresql.conf creating pgpool_remote_start creating basebackup.sh creating recovery.conf creating database cluster /var/lib/pgsql/pgpool-testws/data1...done. update postgresql.conf creating pgpool_remote_start creating basebackup.sh creating recovery.conf sed: can't read /var/lib/pgsql/pgpool-testws/etc/pool_hba.conf: No such file or directory sed: can't read /var/lib/pgsql/pgpool-testws/etc/pool_hba.conf: No such file or directory temporarily start data0 cluster to create extensions temporarily start pgpool-II to create standby nodes node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| replication_state \| replication_sync_state \| last_status_change ---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 \| /var/run/postgresql \| 11002 \| up \| 0.500000 \| primary \| 0 \| true \| 0 \| \| \| 2020-07-08 15:30:44 1 \| /var/run/postgresql \| 11003 \| down \| 0.500000 \| standby \| 0 \| false \| 0 \| \| \| 2020-07-08 15:30:43 (2 rows) recovery node 1...ERROR: recovery is checking if postmaster is started DETAIL: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start in 90 second done. creating follow master script node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| replication_state \| replication_sync_state \| last_status_change ---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 \| /var/run/postgresql \| 11002 \| up \| 0.500000 \| primary \| 0 \| true \| 0 \| \| \| 2020-07-08 15:30:44 1 \| /var/run/postgresql \| 11003 \| down \| 0.500000 \| standby \| 0 \| false \| 0 \| \| \| 2020-07-08 15:30:43 (2 rows) shutdown all pgpool-II setting for streaming replication mode is done. To start the whole system, use /var/lib/pgsql/pgpool-testws/startall. To shutdown the whole system, use /var/lib/pgsql/pgpool-testws/shutdownall. pcp command user name is "postgres", password is "postgres". Each PostgreSQL, pgpool-II and pcp port is as follows: 0000001 port is 11002 0000002 port is 11003 pgpool port is 11000 pcp port is 11001 The info above is in README.port. $ cd pgpool-testws/ $ ./startall waiting for server to start....6806 2020-07-08 15:32:55 CEST LOG: starting PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit 6806 2020-07-08 15:32:55 CEST LOG: listening on IPv4 address "0.0.0.0", port 11002 6806 2020-07-08 15:32:55 CEST LOG: listening on IPv6 address "::", port 11002 6806 2020-07-08 15:32:55 CEST LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.11002" 6806 2020-07-08 15:32:55 CEST LOG: redirecting log output to logging collector process 6806 2020-07-08 15:32:55 CEST HINT: Future log output will appear in directory "log". done server started waiting for server to start....6817 2020-07-08 15:32:55 CEST LOG: starting PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit 6817 2020-07-08 15:32:55 CEST LOG: listening on IPv4 address "0.0.0.0", port 11003 6817 2020-07-08 15:32:55 CEST LOG: listening on IPv6 address "::", port 11003 6817 2020-07-08 15:32:55 CEST LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.11003" 6817 2020-07-08 15:32:55 CEST LOG: redirecting log output to logging collector process 6817 2020-07-08 15:32:55 CEST HINT: Future log output will appear in directory "log". done server started $ psql -p 11000 -c "show pool_nodes" test node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| replication_state \| replication_sync_state \| last_status_change ---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 \| /var/run/postgresql \| 11002 \| up \| 0.500000 \| primary \| 0 \| false \| 0 \| \| \| 2020-07-08 15:33:07 1 \| /var/run/postgresql \| 11003 \| up \| 0.500000 \| standby \| 0 \| true \| 0 \| streaming \| async \| 2020-07-08 15:33:07 (2 rows) $ /usr/pgsql-12/bin/pg_ctl -D /var/lib/pgsql/pgpool-testws/data0 stop waiting for server to shut down.... done server stopped $ psql -p 11000 -c "show pool_nodes" test node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| replication_state \| replication_sync_state \| last_status_change ---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 \| /var/run/postgresql \| 11002 \| down \| 0.500000 \| standby \| 0 \| false \| 0 \| \| \| 2020-07-08 15:34:41 1 \| /var/run/postgresql \| 11003 \| up \| 0.500000 \| primary \| 0 \| true \| 0 \| \| \| 2020-07-08 15:34:41 (2 rows) $ pcp_recovery_node -p 11001 -n 0 Password: ERROR: connection to socket "/tmp/.s.PGSQL.11001" failed with error "No such file or directory" $ pcp_recovery_node -p 11001 -h 127.0.0.1 -n 0 Password: ERROR: recovery is checking if postmaster is started DETAIL: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start in 90 second $ ls data? data0: PG_VERSION basebackup.sh global myrecovery.conf pg_dynshmem pg_ident.conf pg_multixact pg_replslot pg_snapshots pg_stat_tmp pg_tblspc pg_wal pgpool_remote_start postgresql.conf base current_logfiles log pg_commit_ts pg_hba.conf pg_logical pg_notify pg_serial pg_stat pg_subtrans pg_twophase pg_xact postgresql.auto.conf postmaster.opts data1: PG_VERSION basebackup.sh log pg_dynshmem pg_logical pg_replslot pg_stat pg_tblspc pg_xact postgresql.auto.conf postmaster.pid backup_label.old current_logfiles myrecovery.conf pg_hba.conf pg_multixact pg_serial pg_stat_tmp pg_twophase pg_xlog postgresql.conf standby.signal base global pg_commit_ts pg_ident.conf pg_notify pg_snapshots pg_subtrans pg_wal pgpool_remote_start postmaster.opts $ /usr/pgsql-12/bin/pg_ctl -D /var/lib/pgsql/pgpool-testws/data0 start waiting for server to start....7111 2020-07-08 15:41:24 CEST LOG: starting PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit 7111 2020-07-08 15:41:24 CEST LOG: listening on IPv4 address "0.0.0.0", port 11002 7111 2020-07-08 15:41:24 CEST LOG: listening on IPv6 address "::", port 11002 7111 2020-07-08 15:41:24 CEST LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.11002" 7111 2020-07-08 15:41:24 CEST LOG: redirecting log output to logging collector process 7111 2020-07-08 15:41:24 CEST HINT: Future log output will appear in directory "log". done server started $ psql -p 11000 -c "show pool_nodes" test node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| replication_state \| replication_sync_state \| last_status_change ---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 \| /var/run/postgresql \| 11002 \| down \| 0.500000 \| standby \| 0 \| false \| 0 \| \| \| 2020-07-08 15:34:41 1 \| /var/run/postgresql \| 11003 \| up \| 0.500000 \| primary \| 0 \| true \| 0 \| \| \| 2020-07-08 15:34:41 (2 rows) $ pcp_attach_node -h 127.0.0.1 -p 11001 0 Password: pcp_attach_node -- Command Successful [postgres@CentOS81 pgpool-testws]$ psql -p 11000 -c "show pool_nodes" test node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| replication_state \| replication_sync_state \| last_status_change ---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 \| /var/run/postgresql \| 11002 \| up \| 0.500000 \| primary \| 0 \| true \| 0 \| \| \| 2020-07-08 15:42:14 1 \| /var/run/postgresql \| 11003 \| up \| 0.500000 \| standby \| 0 \| false \| 56 \| \| \| 2020-07-08 15:42:10 (2 rows) $
Additional Information	gbench command not executed because is not included in rpm. CentOS installed with LANG=pl_PL.UTF-8 but changed to en_EN.UTF-8 for postgres user. It will be nice if you make introductory replication procedure working as described and adjust Postgresql files' locations paths to CentOS rpm distribution. I will not be available from 10 July till 24 July, so I may answer your questions after this time. Contents of logs: "failover.log": Wed Jul 8 15:34:38 CEST 2020 failed_node_id 0 failed_host_name /var/run/postgresql failed_port 11002 failed_db_cluster /var/lib/pgsql/pgpool-testws/data0 new_master_id 1 old_master_id 0 new_master_host_name /var/run/postgresql old_primary_node_id 0 new_master_port_number 11003 new_master_db_cluster /var/lib/pgsql/pgpool-testws/data1 /usr/pgsql-12/bin/pg_ctl -D /var/lib/pgsql/pgpool-testws/data1 promote waiting for server to promote.... done server promoted Wed Jul 8 15:34:41 CEST 2020 failed_node_id 0 failed_host_name /var/run/postgresql failed_port 11002 failed_db_cluster /var/lib/pgsql/pgpool-testws/data0 new_master_id 1 old_master_id 0 new_master_host_name /var/run/postgresql old_primary_node_id 0 new_master_port_number 11003 new_master_db_cluster /var/lib/pgsql/pgpool-testws/data1 /var/lib/pgsql/pgpool-testws/data0 is not running. skipping follow master command. "pgpool_status": up up "recovery.log": Wed Jul 8 15:30:44 CEST 2020 pg_rewind starts pg_rewind: connected to server pg_rewind: fatal: source and target clusters are from different systems pg_rewind failed. Fall back to rsync source: /var/lib/pgsql/pgpool-testws/data0 dest: /var/lib/pgsql/pgpool-testws/data1 Wed Jul 8 15:36:20 CEST 2020 pg_rewind starts pg_rewind: connected to server pg_rewind: servers diverged at WAL location 0/30020E0 on timeline 1 pg_rewind: no rewind required "pgpool.log": 020-07-08 15:32:55: pid 6825: LOG: Backend status file /var/lib/pgsql/pgpool-testws/log/pgpool_status discarded 2020-07-08 15:32:55: pid 6825: LOG: memory cache initialized 2020-07-08 15:32:55: pid 6825: DETAIL: memcache blocks :64 2020-07-08 15:32:55: pid 6825: LOG: pool_discard_oid_maps: discarded memqcache oid maps 2020-07-08 15:32:55: pid 6825: LOG: Setting up socket for ::1:11000 2020-07-08 15:32:55: pid 6825: LOG: Setting up socket for 127.0.0.1:11000 2020-07-08 15:32:55: pid 6825: LOG: find_primary_node_repeatedly: waiting for finding a primary node 2020-07-08 15:32:55: pid 6825: LOG: find_primary_node: primary node is 0 2020-07-08 15:32:55: pid 6825: LOG: find_primary_node: standby node is 1 2020-07-08 15:32:55: pid 6862: LOG: PCP process: 6862 started 2020-07-08 15:32:55: pid 6825: LOG: pgpool-II successfully started. version 4.1.2 (karasukiboshi) 2020-07-08 15:32:55: pid 6825: LOG: node status[0]: 1 2020-07-08 15:32:55: pid 6825: LOG: node status[1]: 2 2020-07-08 15:33:07: pid 6858: LOG: DB node id: 0 backend pid: 6875 statement: SELECT version() 2020-07-08 15:33:07: pid 6858: LOG: pool_reuse_block: blockid: 0 2020-07-08 15:33:07: pid 6858: CONTEXT: while searching system catalog, When relcache is missed 2020-07-08 15:33:07: pid 6858: LOG: DB node id: 0 backend pid: 6875 statement: DISCARD ALL 2020-07-08 15:33:07: pid 6858: LOG: DB node id: 1 backend pid: 6876 statement: DISCARD ALL 2020-07-08 15:34:35: pid 6864: LOG: failed to connect to PostgreSQL server by unix domain socket 2020-07-08 15:34:35: pid 6864: DETAIL: connect to "/var/run/postgresql/.s.PGSQL.11002" failed with error "No such file or directory" 2020-07-08 15:34:35: pid 6864: ERROR: failed to make persistent db connection 2020-07-08 15:34:35: pid 6864: DETAIL: connection to host:"/var/run/postgresql:11002" failed 2020-07-08 15:34:35: pid 6864: LOG: health check retrying on DB node: 0 (round:1) 2020-07-08 15:34:35: pid 6863: LOG: failed to connect to PostgreSQL server by unix domain socket 2020-07-08 15:34:35: pid 6863: DETAIL: connect to "/var/run/postgresql/.s.PGSQL.11002" failed with error "No such file or directory" 2020-07-08 15:34:35: pid 6863: ERROR: failed to make persistent db connection 2020-07-08 15:34:35: pid 6863: DETAIL: connection to host:"/var/run/postgresql:11002" failed 2020-07-08 15:34:36: pid 6864: LOG: failed to connect to PostgreSQL server by unix domain socket 2020-07-08 15:34:36: pid 6864: DETAIL: connect to "/var/run/postgresql/.s.PGSQL.11002" failed with error "No such file or directory" 2020-07-08 15:34:36: pid 6864: ERROR: failed to make persistent db connection 2020-07-08 15:34:36: pid 6864: DETAIL: connection to host:"/var/run/postgresql:11002" failed 2020-07-08 15:34:36: pid 6864: LOG: health check retrying on DB node: 0 (round:2) 2020-07-08 15:34:37: pid 6864: LOG: failed to connect to PostgreSQL server by unix domain socket 2020-07-08 15:34:37: pid 6864: DETAIL: connect to "/var/run/postgresql/.s.PGSQL.11002" failed with error "No such file or directory" 2020-07-08 15:34:37: pid 6864: ERROR: failed to make persistent db connection 2020-07-08 15:34:37: pid 6864: DETAIL: connection to host:"/var/run/postgresql:11002" failed 2020-07-08 15:34:37: pid 6864: LOG: health check retrying on DB node: 0 (round:3) 2020-07-08 15:34:38: pid 6864: LOG: failed to connect to PostgreSQL server by unix domain socket 2020-07-08 15:34:38: pid 6864: DETAIL: connect to "/var/run/postgresql/.s.PGSQL.11002" failed with error "No such file or directory" 2020-07-08 15:34:38: pid 6864: ERROR: failed to make persistent db connection 2020-07-08 15:34:38: pid 6864: DETAIL: connection to host:"/var/run/postgresql:11002" failed 2020-07-08 15:34:38: pid 6864: LOG: health check failed on node 0 (timeout:0) 2020-07-08 15:34:38: pid 6864: LOG: received degenerate backend request for node_id: 0 from pid [6864] 2020-07-08 15:34:38: pid 6825: LOG: Pgpool-II parent process has received failover request 2020-07-08 15:34:38: pid 6825: LOG: starting degeneration. shutdown host /var/run/postgresql(11002) 2020-07-08 15:34:38: pid 6825: LOG: Restart all children 2020-07-08 15:34:38: pid 6825: LOG: execute command: /var/lib/pgsql/pgpool-testws/etc/failover.sh 0 /var/run/postgresql 11002 /var/lib/pgsql/pgpool-testws/data0 1 0 /var/run/postgresql 0 11003 /var/lib/pgsql/pgpool-testws/data1 2020-07-08 15:34:41: pid 6825: LOG: find_primary_node_repeatedly: waiting for finding a primary node 2020-07-08 15:34:41: pid 6825: LOG: find_primary_node: primary node is 1 2020-07-08 15:34:41: pid 6825: LOG: starting follow degeneration. shutdown host /var/run/postgresql(11002) 2020-07-08 15:34:41: pid 6825: LOG: failover: 1 follow backends have been degenerated 2020-07-08 15:34:41: pid 6825: LOG: failover: set new primary node: 1 2020-07-08 15:34:41: pid 6825: LOG: failover: set new master node: 1 2020-07-08 15:34:41: pid 6926: LOG: start triggering follow command. 2020-07-08 15:34:41: pid 6926: LOG: execute command: /var/lib/pgsql/pgpool-testws/etc/follow_master.sh 0 /var/run/postgresql 11002 /var/lib/pgsql/pgpool-testws/data0 1 0 /var/run/postgresql 0 11003 /var/lib/pgsql/pgpool-testws/data1 failover done. shutdown host /var/run/postgresql(11002)2020-07-08 15:34:41: pid 6863: ERROR: Failed to check replication time lag 2020-07-08 15:34:41: pid 6863: DETAIL: No persistent db connection for the node 0 2020-07-08 15:34:41: pid 6863: HINT: check sr_check_user and sr_check_password 2020-07-08 15:34:41: pid 6863: CONTEXT: while checking replication time lag 2020-07-08 15:34:41: pid 6863: LOG: worker process received restart request 2020-07-08 15:34:41: pid 6825: LOG: failover done. shutdown host /var/run/postgresql(11002) 2020-07-08 15:34:41: pid 6961: LOG: DB node id: 1 backend pid: 6963 statement: SELECT version() 2020-07-08 15:34:41: pid 6961: LOG: DB node id: 1 backend pid: 6963 statement: DISCARD ALL 2020-07-08 15:34:42: pid 6862: LOG: restart request received in pcp child process 2020-07-08 15:34:42: pid 6825: LOG: PCP child 6862 exits with status 0 in failover() 2020-07-08 15:34:42: pid 6825: LOG: fork a new PCP child pid 6964 in failover() 2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6828 exits with status 256 2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6828 exited with success and will not be restarted 2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6829 exits with status 256 2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6829 exited with success and will not be restarted 2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6830 exits with status 256 2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6830 exited with success and will not be restarted 2020-07-08 15:34:42: pid 6964: LOG: PCP process: 6964 started 2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6831 exits with status 256 2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6831 exited with success and will not be restarted ... 2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6857 exits with status 256 2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6857 exited with success and will not be restarted 2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6858 exits with status 256 2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6858 exited with success and will not be restarted 2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6859 exits with status 256 ... 2020-07-08 15:34:42: pid 6825: LOG: worker child process with pid: 6863 exits with status 256 2020-07-08 15:34:42: pid 6825: LOG: fork a new worker child process with pid: 6965 2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6926 exits with status 0 2020-07-08 15:34:42: pid 6825: LOG: child process with pid: 6926 exited with success and will not be restarted 2020-07-08 15:36:20: pid 6964: LOG: forked new pcp worker, pid=7002 socket=8 2020-07-08 15:36:20: pid 7002: LOG: starting recovering node 0 2020-07-08 15:36:20: pid 7002: LOG: executing recovery 2020-07-08 15:36:20: pid 7002: DETAIL: starting recovery command: "SELECT pgpool_recovery('basebackup.sh', 'localhost', '/var/lib/pgsql/pgpool-testws/data0', '11003', 0, '11002')" 2020-07-08 15:36:20: pid 7002: LOG: executing recovery 2020-07-08 15:36:20: pid 7002: DETAIL: disabling statement_timeout 2020-07-08 15:36:20: pid 7002: LOG: node recovery, 1st stage is done 2020-07-08 15:36:20: pid 7002: LOG: checking if postmaster is started 2020-07-08 15:36:20: pid 7002: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres (retry 0 times) 2020-07-08 15:36:20: pid 7002: LOG: checking if postmaster is started 2020-07-08 15:36:20: pid 7002: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres 2020-07-08 15:36:23: pid 7002: LOG: checking if postmaster is started 2020-07-08 15:36:23: pid 7002: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres (retry 1 times) 2020-07-08 15:36:23: pid 7002: LOG: checking if postmaster is started 2020-07-08 15:36:23: pid 7002: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres 2020-07-08 15:36:26: pid 7002: LOG: checking if postmaster is started 2020-07-08 15:36:26: pid 7002: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres (retry 2 times) 2020-07-08 15:36:26: pid 7002: LOG: checking if postmaster is started 2020-07-08 15:36:26: pid 7002: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres 2020-07-08 15:36:29: pid 7002: LOG: checking if postmaster is started 2020-07-08 15:36:29: pid 7002: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres (retry 3 times) 2020-07-08 15:36:29: pid 7002: LOG: checking if postmaster is started 2020-07-08 15:36:29: pid 7002: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres 2020-07-08 15:36:32: pid 7002: LOG: checking if postmaster is started 2020-07-08 15:36:32: pid 7002: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres (retry 0 times) 2020-07-08 15:36:32: pid 7002: LOG: checking if postmaster is started 2020-07-08 15:36:32: pid 7002: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres 2020-07-08 15:36:35: pid 7002: LOG: checking if postmaster is started 2020-07-08 15:36:35: pid 7002: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres (retry 1 times) 2020-07-08 15:36:35: pid 7002: LOG: checking if postmaster is started 2020-07-08 15:36:35: pid 7002: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres 2020-07-08 15:36:38: pid 7002: LOG: checking if postmaster is started 2020-07-08 15:36:38: pid 7002: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres (retry 2 times) ... 2020-07-08 15:38:02: pid 7002: LOG: checking if postmaster is started 2020-07-08 15:38:02: pid 7002: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres (retry 30 times) 2020-07-08 15:38:02: pid 7002: LOG: checking if postmaster is started 2020-07-08 15:38:02: pid 7002: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres 2020-07-08 15:38:05: pid 7002: ERROR: recovery is checking if postmaster is started 2020-07-08 15:38:05: pid 7002: DETAIL: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start in 90 second 2020-07-08 15:38:05: pid 6964: LOG: PCP process with pid: 7002 exit with SUCCESS. 2020-07-08 15:38:05: pid 6964: LOG: PCP process with pid: 7002 exits with status 0 2020-07-08 15:39:41: pid 6825: LOG: child process with pid: 6961 exits with status 256 2020-07-08 15:39:41: pid 6825: LOG: fork a new child process with pid: 7070 2020-07-08 15:41:33: pid 6960: LOG: DB node id: 1 backend pid: 7123 statement: SELECT version() 2020-07-08 15:41:33: pid 6960: LOG: DB node id: 1 backend pid: 7123 statement: DISCARD ALL 2020-07-08 15:42:10: pid 6964: LOG: forked new pcp worker, pid=7137 socket=8 2020-07-08 15:42:10: pid 7137: LOG: received failback request for node_id: 0 from pid [7137] 2020-07-08 15:42:10: pid 6825: LOG: Pgpool-II parent process has received failover request 2020-07-08 15:42:10: pid 6825: LOG: starting fail back. reconnect host /var/run/postgresql(11002) 2020-07-08 15:42:10: pid 6825: LOG: Node 1 is not down (status: 2) 2020-07-08 15:42:10: pid 6964: LOG: PCP process with pid: 7137 exit with SUCCESS. 2020-07-08 15:42:10: pid 6964: LOG: PCP process with pid: 7137 exits with status 0 2020-07-08 15:42:10: pid 6825: LOG: Do not restart children because we are failing back node id 0 host: /var/run/postgresql port: 11002 and we are in streaming replication mode and not all backends were down 2020-07-08 15:42:10: pid 6825: LOG: find_primary_node_repeatedly: waiting for finding a primary node 2020-07-08 15:42:10: pid 6825: LOG: find_primary_node: primary node is 0 2020-07-08 15:42:10: pid 6825: LOG: failover: set new primary node: 0 2020-07-08 15:42:10: pid 6825: LOG: failover: set new master node: 0 2020-07-08 15:42:10: pid 6825: LOG: failback done. reconnect host /var/run/postgresql(11002) 2020-07-08 15:42:10: pid 6965: LOG: worker process received restart request 2020-07-08 15:42:11: pid 6964: LOG: restart request received in pcp child process 2020-07-08 15:42:11: pid 6825: LOG: PCP child 6964 exits with status 0 in failover() 2020-07-08 15:42:11: pid 6825: LOG: fork a new PCP child pid 7140 in failover() 2020-07-08 15:42:11: pid 6825: LOG: worker child process with pid: 6965 exits with status 256 2020-07-08 15:42:11: pid 7140: LOG: PCP process: 7140 started 2020-07-08 15:42:11: pid 6825: LOG: fork a new worker child process with pid: 7141 2020-07-08 15:42:11: pid 7141: LOG: get_query_result falied: status: -2 2020-07-08 15:42:11: pid 7141: CONTEXT: while checking replication time lag 2020-07-08 15:42:14: pid 6960: LOG: failover or failback event detected 2020-07-08 15:42:14: pid 6960: DETAIL: restarting myself 2020-07-08 15:42:14: pid 6959: LOG: failover or failback event detected 2020-07-08 15:42:14: pid 6959: DETAIL: restarting myself 2020-07-08 15:42:14: pid 7070: LOG: selecting backend connection 2020-07-08 15:42:14: pid 7070: DETAIL: failover or failback event detected, discarding existing connections 2020-07-08 15:42:14: pid 6825: LOG: child process with pid: 6959 exits with status 256 2020-07-08 15:42:14: pid 6825: LOG: fork a new child process with pid: 7147 2020-07-08 15:42:14: pid 6825: LOG: child process with pid: 6960 exits with status 256 2020-07-08 15:42:14: pid 6825: LOG: fork a new child process with pid: 7148 2020-07-08 15:42:14: pid 7070: LOG: DB node id: 0 backend pid: 7145 statement: SELECT version() 2020-07-08 15:42:14: pid 7070: LOG: DB node id: 0 backend pid: 7145 statement: DISCARD ALL 2020-07-08 15:42:21: pid 7141: LOG: get_query_result falied: status: -2 2020-07-08 15:42:21: pid 7141: CONTEXT: while checking replication time lag 2020-07-08 15:42:31: pid 7141: LOG: get_query_result falied: status: -2 2020-07-08 15:42:31: pid 7141: CONTEXT: while checking replication time lag 2020-07-08 15:42:42: pid 7141: LOG: get_query_result falied: status: -2 ...
Tags	failover, master slave, pcp commands, pcp_recovery_node

t-ishii 2020-07-09 11:47 developer ~0003429	First of all, we recommend to install Pgpool-II rpms from here: https://pgpool.net/mediawiki/index.php/Yum_Repository Because they are maintained by active developer of Pgpool-II. Anyway, > Add line "wal_keep_segments = 10" in a proper place in pgpool_setup script to allow proper creation of secondary data1 database w2ith streaming replication. What do you actually modify pgpool_setup. Please share the diff file. Also please share the PostgreSQL log under data0. While doing recovery, Pgpool-II access primary postgresql and it's important to analyze the PostgreSQL log.

wsz 2020-07-09 17:28 reporter ~0003430	I added "wal_keep_segments = 10" to make your script pgpool_setup with "-s -r" options to successfully complete the process of establishing standby database. Without this parameter first wal was removed from master pg_wal and standby server could not start. I put this line as temporary solution WITHOUT any detailed analysis how it will work for other options, so please keep this in mind: $ diff /usr/bin/pgpool_setup pgpool_setup-ws 321a322 > echo "wal_keep_segments = 10" >> $PGCONF I downloaded 4.1.2-1 version rpms using page: "https://www.pgpool.net/mediawiki/index.php/Downloads" and link for CentOS/RHEL 8, navigating from pgpool.net homepage.Please note that "https://pgpool.net/mediawiki/index.php/Yum_Repository" do not offer rpms for CentOS 8 and older version 4.1-1 is listed for CentOS 7. Please make your web pages more consistent. Before using rpms from www.pgpool.net i got the same results using rpms from "https://download.postgresql.org/pub/repos/yum/12/redhat/rhel-8.2-x86_64/ . I got the same errors as described. To verify it is not Postgres distribution problem I removed their rpms and installed yurs from pgpool.net. I will not be available from tomorrow till 24 July, this is why I described scenario easily to be reproduced by making fresh CentOS and PGPool installations and attached logs from data0 and data1 databases. data0-Wednesday.log (131,200 bytes) data1-Wednesday.log (90,150 bytes)

t-ishii 2020-07-10 09:59 developer ~0003431	I misunderstood that you are not using our Pgpool-II RPMS. Sorry for this. Regarding the our yum repository page, I have sent your request to Pengbo, who is the maintainer of the web page. I have looked into the log closely and found that postmaster of data0 actually had started as a standby server. However pgpool did not recognize it because the recovery process failed to connect to the postmaster. Interesting thing was, pgpool first tried to connect to postgres database and failed. Then tried template1 database but did not succeed to connect to. This is pretty strange and I imagine there's something specific to your environment (CentOS 8.2). I will try to reproduce the issue when I have spare time.

t-ishii 2020-07-19 20:21 developer ~0003432	I have successfully run pgpool_setup on CentOS8.2. I installed Pgpool-II from source code. One thing I have to do was: export PGSOCKET_DIR=/var/run/postgresql before running pgpool_setup. Since the default Unix socket directory is /tmp, you need to adjust to the default of the PostgreSQL package. I didn't have time to try with Pgpool-II rpm though.

wsz 2020-07-29 18:41 reporter ~0003444	Please note that problem was reported against "pcp_recovery_node" command. I previously used PGSOCKET_DIR variable, but final results were the same as with this variable unset. I repeated test with the following variables set in ~/.bash_profile: #export PGPOOL_INSTALL_DIR=/usr/pgpool-12 export PGPOOL_INSTALL_DIR=/usr # PGPools original download #export PGPOOLDIR=/etc/pgpool-II-12 export PGPOOLDIR=/etc/pgpool-II # PGPools original download export PGBIN=/usr/pgsql-12/bin export PGLIB=/usr/pgsql-12/lib export PGSOCKET_DIR=/var/run/postgresql and the following variables are set and visible in a session: $ set \|grep PG PGBIN=/usr/pgsql-12/bin PGDATA=/var/lib/pgsql/12/data PGLIB=/usr/pgsql-12/lib PGPOOLDIR=/etc/pgpool-II PGPOOL_INSTALL_DIR=/usr PGSOCKET_DIR=/var/run/postgresql Now I execute commands: $ pcp_recovery_node -p 11001 -n 0 Password: ERROR: connection to socket "/tmp/.s.PGSQL.11001" failed with error "No such file or directory" IT FAILS imediately. Why it looks for socket in /tmp directory? NEXT I execute with -h option and it works till the moment of checking connection to 11001: $ pcp_recovery_node -p 11001 -h 127.0.0.1 -n 0 Password: ERROR: recovery is checking if postmaster is started DETAIL: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start in 90 second This command does its job but FAILS to START data0 Postgres (data1 is running). This is primary problem. Another one: why it reports socket problem instead of tcp timeout when -h option -s used? Parts of pgpool.log: ... 2020-07-29 11:15:52: pid 2263: LOG: starting recovering node 0 2020-07-29 11:15:52: pid 2263: LOG: executing recovery 2020-07-29 11:15:52: pid 2263: DETAIL: starting recovery command: "SELECT pgpool_recovery('basebackup.sh', 'localhost', '/var/lib/pgsql/pgpool-testws/data0', '11003', 0, '11002')" 2020-07-29 11:15:52: pid 2263: LOG: executing recovery 2020-07-29 11:15:52: pid 2263: DETAIL: disabling statement_timeout 2020-07-29 11:15:52: pid 2263: LOG: node recovery, 1st stage is done 2020-07-29 11:15:52: pid 2263: LOG: checking if postmaster is started 2020-07-29 11:15:52: pid 2263: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres (retry 0 times) 2020-07-29 11:15:52: pid 2263: LOG: checking if postmaster is started 2020-07-29 11:15:52: pid 2263: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres 2020-07-29 11:15:55: pid 2263: LOG: checking if postmaster is started 2020-07-29 11:15:55: pid 2263: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres (retry 1 times) 2020-07-29 11:15:55: pid 2263: LOG: checking if postmaster is started 2020-07-29 11:15:55: pid 2263: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres 2020-07-29 11:15:58: pid 2263: LOG: checking if postmaster is started 2020-07-29 11:15:58: pid 2263: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:postgres user:postgres (retry 2 times) ... 2020-07-29 11:17:34: pid 2263: DETAIL: trying to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres (retry 30 times) 2020-07-29 11:17:34: pid 2263: LOG: checking if postmaster is started 2020-07-29 11:17:34: pid 2263: DETAIL: failed to connect to postmaster on hostname:/var/run/postgresql database:template1 user:postgres 2020-07-29 11:17:37: pid 2263: ERROR: recovery is checking if postmaster is started 2020-07-29 11:17:37: pid 2263: DETAIL: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start in 90 second ...

wsz 2020-07-29 18:54 reporter ~0003445	The line: "NEXT I execute with -h option and it works till the moment of checking connection to 11001:" is incorrect. It should be: "NEXT I execute with -h option and it works till the moment of checking connection to database template1 (data0 probably):"

t-ishii 2020-07-30 17:40 developer ~0003446	> ERROR: connection to socket "/tmp/.s.PGSQL.11001" failed with error "No such file or directory" > IT FAILS imediately. Why it looks for socket in /tmp directory? Yeah, currently pcp clients unconditionally try to connect to socket in /tmp. Probably RPM should patch this. > ERROR: recovery is checking if postmaster is started > DETAIL: postmaster on hostname:"/var/run/postgresql" database:"template1" user:"postgres" failed to start in 90 second I suspect postmaster did not start. After online recovery succeeds, Pgpool-II tries to start postmaster by using pgpool_remote_start script in data0. In the script ssh is used and it requires postgres user can ssh to localhost without password. i.e. you should be able to something like this without password: ssh localhost ls Can you do that?

wsz 2020-07-31 23:09 reporter ~0003447	It started to work with default settings after establishing ssh access to localhost without password. It is not so obvious for single server installation and is not described in installation manual (in user manual only). Maybe someone could add it to installation manual in pgpool_setup section and it's manual page?

wsz 2020-08-01 00:10 reporter ~0003448	Unfortunately switching back to working master data0 and slave data1 configuration is not correct when executing comands: pgpool_setup -s -r ./startall /usr/pgsql-12/bin/pg_ctl -D /var/lib/pgsql/pgpool-testws/data0 stop pcp_recovery_node -p 11001 -h 127.0.0.1 -n 0 data0 and data1 are not in sync after. Copy of screen: [postgres@CentOS81 pgpool-testws]$ pgpool_setup -s -r PostgreSQL major version: 123 Starting set up in streaming replication mode cp: cannot stat '/etc/pgpool-II/pool_hba.conf.sample': No such file or directory creating startall and shutdownall creating failover script creating database cluster /var/lib/pgsql/pgpool-testws/data0...done. update postgresql.conf creating pgpool_remote_start creating basebackup.sh creating recovery.conf creating database cluster /var/lib/pgsql/pgpool-testws/data1...done. update postgresql.conf creating pgpool_remote_start creating basebackup.sh creating recovery.conf sed: can't read /var/lib/pgsql/pgpool-testws/etc/pool_hba.conf: No such file or directory sed: can't read /var/lib/pgsql/pgpool-testws/etc/pool_hba.conf: No such file or directory temporarily start data0 cluster to create extensions temporarily start pgpool-II to create standby nodes node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| replication_state \| replication_sync_state \| last_status_change ---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+-------------------- - 0 \| /var/run/postgresql \| 11002 \| up \| 0.500000 \| primary \| 0 \| true \| 0 \| \| \| 2020-07-31 16:45:04 1 \| /var/run/postgresql \| 11003 \| down \| 0.500000 \| standby \| 0 \| false \| 0 \| \| \| 2020-07-31 16:45:03 (2 rows) recovery node 1...pcp_recovery_node -- Command Successful done. creating follow master script node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| replication_state \| replication_sync_state \| last_status_change ---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 \| /var/run/postgresql \| 11002 \| up \| 0.500000 \| primary \| 0 \| false \| 0 \| \| \| 2020-07-31 16:45:04 1 \| /var/run/postgresql \| 11003 \| up \| 0.500000 \| standby \| 0 \| true \| 0 \| \| \| 2020-07-31 16:45:08 (2 rows) shutdown all pgpool-II setting for streaming replication mode is done. To start the whole system, use /var/lib/pgsql/pgpool-testws/startall. To shutdown the whole system, use /var/lib/pgsql/pgpool-testws/shutdownall. pcp command user name is "postgres", password is "postgres". Each PostgreSQL, pgpool-II and pcp port is as follows: 0000001 port is 11002 0000002 port is 11003 pgpool port is 11000 pcp port is 11001 The info above is in README.port. [postgres@CentOS81 pgpool-testws]$ ./startall waiting for server to start....5825 2020-07-31 16:45:18 CEST LOG: starting PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit 5825 2020-07-31 16:45:18 CEST LOG: listening on IPv4 address "0.0.0.0", port 11002 5825 2020-07-31 16:45:18 CEST LOG: listening on IPv6 address "::", port 11002 5825 2020-07-31 16:45:18 CEST LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.11002" 5825 2020-07-31 16:45:18 CEST LOG: redirecting log output to logging collector process 5825 2020-07-31 16:45:18 CEST HINT: Future log output will appear in directory "log". done server started waiting for server to start....5836 2020-07-31 16:45:19 CEST LOG: starting PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit 5836 2020-07-31 16:45:19 CEST LOG: listening on IPv4 address "0.0.0.0", port 11003 5836 2020-07-31 16:45:19 CEST LOG: listening on IPv6 address "::", port 11003 5836 2020-07-31 16:45:19 CEST LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.11003" 5836 2020-07-31 16:45:19 CEST LOG: redirecting log output to logging collector process 5836 2020-07-31 16:45:19 CEST HINT: Future log output will appear in directory "log". done server started [postgres@CentOS81 pgpool-testws]$ psql -p 11000 -c "show pool_nodes" test node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| replication_state \| replication_sync_state \| last_status_change ---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+-------------------- - 0 \| /var/run/postgresql \| 11002 \| up \| 0.500000 \| primary \| 0 \| false \| 0 \| \| \| 2020-07-31 16:45:52 1 \| /var/run/postgresql \| 11003 \| up \| 0.500000 \| standby \| 0 \| true \| 0 \| streaming \| async \| 2020-07-31 16:45:52 (2 rows) [postgres@CentOS81 pgpool-testws]$ /usr/pgsql-12/bin/pg_ctl -D /var/lib/pgsql/pgpool-testws/data0 stop waiting for server to shut down.... done server stopped [postgres@CentOS81 pgpool-testws]$ psql -p 11000 -c "show pool_nodes" test node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| replication_state \| replication_sync_state \| last_status_change ---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+-------------------- - 0 \| /var/run/postgresql \| 11002 \| down \| 0.500000 \| standby \| 0 \| false \| 0 \| \| \| 2020-07-31 16:46:14 1 \| /var/run/postgresql \| 11003 \| up \| 0.500000 \| primary \| 0 \| true \| 0 \| \| \| 2020-07-31 16:46:14 (2 rows) [postgres@CentOS81 pgpool-testws]$ pcp_recovery_node -p 11001 -h 127.0.0.1 -n 0 Password: pcp_recovery_node -- Command Successful [postgres@CentOS81 pgpool-testws]$ psql -p 11000 -c "show pool_nodes" test node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| replication_state \| replication_sync_state \| last_status_change ---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+-------------------- - 0 \| /var/run/postgresql \| 11002 \| up \| 0.500000 \| primary \| 0 \| false \| 0 \| \| \| 2020-07-31 16:46:53 1 \| /var/run/postgresql \| 11003 \| up \| 0.500000 \| standby \| 0 \| true \| 0 \| \| \| 2020-07-31 16:46:47 (2 rows) [postgres@CentOS81 pgpool-testws]$ psql -p 11000 -c "show pool_nodes" test node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| replication_state \| replication_sync_state \| last_status_change ---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+-------------------- - 0 \| /var/run/postgresql \| 11002 \| up \| 0.500000 \| primary \| 0 \| false \| 0 \| \| \| 2020-07-31 16:46:53 1 \| /var/run/postgresql \| 11003 \| up \| 0.500000 \| standby \| 0 \| true \| 56 \| \| \| 2020-07-31 16:46:47 (2 rows) [postgres@CentOS81 pgpool-testws]$ ls data? data0: PG_VERSION current_logfiles myrecovery.conf pg_hba.conf pg_multixact pg_serial pg_stat_tmp pg_twophase pgpool_remote_start postmaster.opts base global pg_commit_ts pg_ident.conf pg_notify pg_snapshots pg_subtrans pg_wal postgresql.auto.conf postmaster.pid basebackup.sh log pg_dynshmem pg_logical pg_replslot pg_stat pg_tblspc pg_xact postgresql.conf data1: PG_VERSION basebackup.sh log pg_dynshmem pg_logical pg_replslot pg_stat pg_tblspc pg_xact postgresql.auto.conf postmaster.pid backup_label.old current_logfiles myrecovery.conf pg_hba.conf pg_multixact pg_serial pg_stat_tmp pg_twophase pg_xlog postgresql.conf standby.signal base global pg_commit_ts pg_ident.conf pg_notify pg_snapshots pg_subtrans pg_wal pgpool_remote_start postmaster.opts [postgres@CentOS81 pgpool-testws]$ vi data1/log/Friday.log [postgres@CentOS81 pgpool-testws]$ ls -l data?/pg_wal data0/pg_wal: total 32768 -rw-------. 1 postgres postgres 16777216 Jul 31 16:47 000000010000000000000003 -rw-------. 1 postgres postgres 16777216 Jul 31 16:45 000000010000000000000004 drwx------. 2 postgres postgres 6 Jul 31 16:45 archive_status data1/pg_wal: total 49156 -rw-------. 1 postgres postgres 16777216 Jul 31 16:46 000000010000000000000003 -rw-------. 1 postgres postgres 41 Jul 31 16:46 00000002.history -rw-------. 1 postgres postgres 16777216 Jul 31 16:46 000000020000000000000003 -rw-------. 1 postgres postgres 16777216 Jul 31 16:45 000000020000000000000004 drwx------. 2 postgres postgres 6 Jul 31 16:45 archive_status [postgres@CentOS81 pgpool-testws]$ Logs included as attachements Friday-pgpool.log (46,273 bytes) data0-Friday.log (37,834 bytes) data1-Friday.log (29,247 bytes)

t-ishii 2020-08-04 08:21 developer ~0003451	> Unfortunately switching back to working master data0 and slave data1 configuration is not correct when executing comands: What do you mean by "is not correct"? Can you elaborate more?

wsz 2020-08-04 17:24 reporter ~0003452	After ececuting commnad "pcp_recovery_node -p 11001 -h 127.0.0.1 -n 0" data0 is switched to master and data1 to slave but process is incorrect because data1 is on timeline 2 and data0 on timeline 1, so replication is broken. See data 1 log after restart of both databases (nothing was done since my last post): 1637 2020-08-04 10:02:21 CEST LOG: database system was shut down at 2020-07-31 17:12:39 CEST 1637 2020-08-04 10:02:21 CEST LOG: entering standby mode 1637 2020-08-04 10:02:21 CEST LOG: consistent recovery state reached at 0/3002238 1637 2020-08-04 10:02:21 CEST LOG: invalid record length at 0/3002238: wanted 24, got 0 1635 2020-08-04 10:02:21 CEST LOG: database system is ready to accept read only connections 1641 2020-08-04 10:02:21 CEST FATAL: highest timeline 1 of the primary is behind recovery timeline 2 1643 2020-08-04 10:02:21 CEST FATAL: highest timeline 1 of the primary is behind recovery timeline 2 1681 2020-08-04 10:02:21 CEST LOG: statement: SELECT pg_is_in_recovery() 1681 2020-08-04 10:02:21 CEST LOG: statement: SELECT current_setting('server_version_num') 1689 2020-08-04 10:02:21 CEST LOG: statement: SELECT current_setting('server_version_num') 1689 2020-08-04 10:02:21 CEST LOG: statement: SELECT pg_last_wal_replay_lsn() 1689 2020-08-04 10:02:21 CEST LOG: statement: SELECT pg_is_in_recovery() 1690 2020-08-04 10:02:26 CEST FATAL: highest timeline 1 of the primary is behind recovery timeline 2 1693 2020-08-04 10:02:31 CEST FATAL: highest timeline 1 of the primary is behind recovery timeline 2 ... pg_wal directories contents: $ ls -l data?/pg_wal data0/pg_wal: total 32768 -rw-------. 1 postgres postgres 16777216 Aug 4 10:07 000000010000000000000003 -rw-------. 1 postgres postgres 16777216 Jul 31 16:45 000000010000000000000004 drwx------. 2 postgres postgres 6 Jul 31 16:45 archive_status data1/pg_wal: total 49156 -rw-------. 1 postgres postgres 16777216 Jul 31 16:46 000000010000000000000003 -rw-------. 1 postgres postgres 41 Jul 31 16:46 00000002.history -rw-------. 1 postgres postgres 16777216 Jul 31 17:12 000000020000000000000003 -rw-------. 1 postgres postgres 16777216 Jul 31 16:45 000000020000000000000004 drwx------. 2 postgres postgres 6 Jul 31 16:45 archive_status Additionally: From my post above: commnad psql -p 11000 -c "show pool_nodes" test shows state of replication as: Status just after executing "pgpool_setup -s -r" command: 0 \| /var/run/postgresql \| 11002 \| up \| 0.500000 \| primary \| 0 \| false \| 0 \| \| \| 2020-07-31 16:45:52 1 \| /var/run/postgresql \| 11003 \| up \| 0.500000 \| standby \| 0 \| true \| 0 \| streaming \| async \| 2020-07-31 16:45:52 and after executing command "pcp_recovery_node -p 11001 -h 127.0.0.1 -n 0": 0 \| /var/run/postgresql \| 11002 \| up \| 0.500000 \| primary \| 0 \| false \| 0 \| \| \| 2020-07-31 16:46:53 1 \| /var/run/postgresql \| 11003 \| up \| 0.500000 \| standby \| 0 \| true \| 0 \| \| \| 2020-07-31 16:46:47 There is no "streaming \| async" reported.

t-ishii 2020-08-04 21:39 developer ~0003453 Last edited: 2020-08-04 21:40	I was able to succeed in reproducing the problem. It seems the issue happens only when "-r" (use pg_rewind for recovery). In log/recovery.log: Tue Aug 4 21:34:40 JST 2020 pg_rewind starts pg_rewind: connected to server pg_rewind: fatal: source and target clusters are from different systems pg_rewind failed. Fall back to rsync source: /home/t-ishii/work/Pgpool-II/current/c/data0 dest: /home/t-ishii/work/Pgpool-II/current/c/data1 Tue Aug 4 21:35:32 JST 2020 pg_rewind starts pg_rewind: connected to server pg_rewind: servers diverged at WAL location 0/50000A0 on timeline 1 pg_rewind: no rewind required So it seems there's something wrong with the usage of pg_rewind in the data[01]/basebackeup.sh script which was generated by pgpool_setup. I will look into this.

t-ishii 2020-08-05 09:26 developer ~0003454	It turned out that the problem we are seeing is caused by a bug in pgpool_setup. When pgpool_setup using pg_rewind, it did not create standby.signal file on the target node (node 0). Instead it created the file on the primary node (node 1). As a result the recovered node booted up as a primary node, rather than a standby node. In the case when pg_rewind is not used, the standby.signal file created on node 1 is rsynced to the node 0, thus the node booted up as a standby node. I attached the diff file to fix the problem. Can you please try it out? pgpool_setup.diff (518 bytes) pgpool_setup.diff (518 bytes)

wsz 2020-08-05 18:35 reporter ~0003455	After applying patch (to pgpool_setup from rpm distribution) is better but not OK. See session log with my comments in capital letters: $ psql -p 11000 -c "show pool_nodes" test node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| replication_state \| replication_sync_state \| last_status_change ---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 \| /var/run/postgresql \| 11002 \| up \| 0.500000 \| primary \| 0 \| false \| 0 \| \| \| 2020-08-05 11:15:22 1 \| /var/run/postgresql \| 11003 \| up \| 0.500000 \| standby \| 0 \| true \| 0 \| streaming \| async \| 2020-08-05 11:15:22 (2 rows) # TILL NOW OK. $ /usr/pgsql-12/bin/pg_ctl -D /var/lib/pgsql/pgpool-testws/data0 stop waiting for server to shut down.... done server stopped $ psql -p 11000 -c "show pool_nodes" test psql: error: could not connect to server: FATAL: failed to create a backend connection DETAIL: executing failover on backend # TILL NOW OK. $ psql -p 11000 -c "show pool_nodes" test psql: error: could not connect to server: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. # I DO NOT LIKE SUCH MESSAGES EVEN IF THEY ARE TEMPORARY. $ psql -p 11000 -c "show pool_nodes" test node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| replication_state \| replication_sync_state \| last_status_change ---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 \| /var/run/postgresql \| 11002 \| down \| 0.500000 \| standby \| 0 \| false \| 0 \| \| \| 2020-08-05 11:15:38 1 \| /var/run/postgresql \| 11003 \| up \| 0.500000 \| primary \| 0 \| true \| 0 \| \| \| 2020-08-05 11:15:38 (2 rows) # TILL NOW OK. $ pcp_recovery_node -p 11001 -h 127.0.0.1 -n 0 Password: pcp_recovery_node -- Command Successful $ psql -p 11000 -c "show pool_nodes" test node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| replication_state \| replication_sync_state \| last_status_change ---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 \| /var/run/postgresql \| 11002 \| up \| 0.500000 \| standby \| 0 \| false \| 0 \| streaming \| async \| 2020-08-05 11:16:53 1 \| /var/run/postgresql \| 11003 \| up \| 0.500000 \| primary \| 0 \| true \| 0 \| \| \| 2020-08-05 11:15:38 (2 rows) # TILL NOW LOOKS OK BUT ... $ psql -p 11000 -c "show pool_nodes" test node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| replication_state \| replication_sync_state \| last_status_change ---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 \| /var/run/postgresql \| 11002 \| up \| 0.500000 \| standby \| 0 \| true \| 0 \| streaming \| async \| 2020-08-05 11:16:53 1 \| /var/run/postgresql \| 11003 \| up \| 0.500000 \| primary \| 0 \| false \| 0 \| \| \| 2020-08-05 11:15:38 (2 rows) # IS LOAD_BALANCE_MODE CHANGED/UPDATED WITH DELAY? BUT IT SEEMS TO BE OK. $ ./shutdownall 2020-08-05 11:17:57: pid 30490: LOG: stop request sent to pgpool. waiting for termination... .done. waiting for server to shut down.... done server stopped waiting for server to shut down.... done server stopped #UP TO THIS MOMENT! $ ./startall waiting for server to start....30507 2020-08-05 11:18:03 CEST LOG: starting PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit 30507 2020-08-05 11:18:03 CEST LOG: listening on IPv4 address "0.0.0.0", port 11002 30507 2020-08-05 11:18:03 CEST LOG: listening on IPv6 address "::", port 11002 30507 2020-08-05 11:18:03 CEST LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.11002" 30507 2020-08-05 11:18:03 CEST LOG: redirecting log output to logging collector process 30507 2020-08-05 11:18:03 CEST HINT: Future log output will appear in directory "log". done server started waiting for server to start....30517 2020-08-05 11:18:03 CEST LOG: starting PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit 30517 2020-08-05 11:18:03 CEST LOG: listening on IPv4 address "0.0.0.0", port 11003 30517 2020-08-05 11:18:03 CEST LOG: listening on IPv6 address "::", port 11003 30517 2020-08-05 11:18:03 CEST LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.11003" 30517 2020-08-05 11:18:03 CEST LOG: redirecting log output to logging collector process 30517 2020-08-05 11:18:03 CEST HINT: Future log output will appear in directory "log". done server started $ psql -p 11000 -c "show pool_nodes" test node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| replication_state \| replication_sync_state \| last_status_change ---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 \| /var/run/postgresql \| 11002 \| up \| 0.500000 \| standby \| 0 \| false \| 0 \| \| \| 2020-08-05 11:18:07 1 \| /var/run/postgresql \| 11003 \| up \| 0.500000 \| standby \| 0 \| true \| 0 \| \| \| 2020-08-05 11:18:07 (2 rows) ]$ ls -l data?/standby.signal -rw-------. 1 postgres postgres 0 Aug 5 11:16 data0/standby.signal -rw-------. 1 postgres postgres 0 Aug 5 11:16 data1/standby.signal $ # WE ENDED WITH TWO STANDBY DATABASES AFTER RESTART. Another patch is welcome.

t-ishii 2020-08-05 19:07 developer ~0003456	Oops. Previous patch forgot to remove standby.signal after recovery. Attached is the patch to fix that. It should be applied based on previous patch. pgpool_setup_v2.diff (355 bytes) pgpool_setup_v2.diff (355 bytes)

wsz 2020-08-05 21:46 reporter ~0003457	Now it works. But in patched "pgpool_setup" there are probably lines with duplicated functionality: lines 427, 428: touch standby.signal touch $DEST_CLUSTER/standby.signal line 453: rm standby.signal - why not rm $DEST_CLUSTER/standby.signal ?

t-ishii 2020-08-06 09:30 developer ~0003458 Last edited: 2020-08-06 09:31	> Now it works. But in patched "pgpool_setup" there are probably lines with duplicated functionality: > lines 427, 428: > touch standby.signal > touch $DEST_CLUSTER/standby.signal They are both necessary. The former is necessary when pg_rewind is not used or failed. In this case rsync will copy standby.signal to the $DEST_CLUSTER. The latter is used when pg_rewind succeeded. Just leave the latter one is not good because when rsync is used, rsync removes the file. > line 453: > rm standby.signal - why not rm $DEST_CLUSTER/standby.signal ? If you do that, $DEST_CUSTER will start up as a primary node. Thinking more, I realized that I can reduce some of the code by putting: touch $DEST_CLUSTER/standby.signal At the end of creating script. Attached is the patch for this. It should be applied on top of pgpool_setup.diff (not pgpool_setup_v2.diff). pgpool_setup_v3.diff (734 bytes) pgpool_setup_v3.diff (734 bytes)

wsz 2020-08-07 00:12 reporter ~0003459	It works OK with original pgpool_setup with pgpool_setup.diff and pgpool_setup_v3.diff patches applied. Just for testing I disabled sockets for data0 and data1 configurations (usefull for installations where Application server is on separate host from PGpool and Postgres) and adjusted backend_hostname[01] in pgpool.conf to use IP address instead of sockets. After update of basebackup.sh and myrecovery.conf (file and relaed part of basebackup.sh) adding "host=127.0.0.1" to: $pg_rewind -P -D $DEST_CLUSTER --source-server="host=127.0.0.1 port=$PORT user=$PGSUPERUSER dbname=postgres" >> $log 2>&1 primary_conninfo = 'host=127.0.0.1 port=$PORT user=$PGSUPERUSER application_name=''server$recovery_node''' It worked well. Of course 127.0.0.1 should be replaced with values taken from variables in real implementation. What about such extension of pgpool_setup and created basebackup.sh, myrecovery.conf scripts? It will be more universal.

t-ishii 2020-08-07 12:14 developer ~0003460	Thanks. Patch committed. > Of course 127.0.0.1 should be replaced with values taken from variables in real implementation. In addition to this every occurrence of lines including $DEST should be adjusted. i.e rm -fr $DEST_CLUSTER/pg_xlog should be changed to: ssh hostname rm -fr $DEST_CLUSTER/pg_xlog > What about such extension of pgpool_setup and created basebackup.sh, myrecovery.conf scripts? It will be more universal. But we already have more production ready examples: http://localhost/~t-ishii/pgpool-II/html/example-cluster.html#EXAMPLE-CLUSTER-PGPOOL-CONFIG What's a benefit to extend pgpool_setup to produce a production ready example?

wsz 2020-08-07 17:40 reporter ~0003461	It is https://www.pgpool.net/docs/latest/en/html/examples.html probably (not on your localhost). I can't see example of basebackup.sh script there. Another question: who is responsible for maintaining rpm distributions from https://www.pgpool.net/mediawiki/index.php/Downloads or i.e. from CentOS distro? Just to correct bugs like: $ pcp_recovery_node -p 11001 -n 0 Password: ERROR: connection to socket "/tmp/.s.PGSQL.11001" failed with error "No such file or directory" This one can be fixed probably by discovering OS during rpm creation from sources.

t-ishii 2020-08-07 18:45 developer ~0003462	Oops. Yes it is: It is https://www.pgpool.net/docs/latest/en/html/examples.html In the doc baseback.sh is called "recovery_1st_stage ". > This one can be fixed probably by discovering OS during rpm creation from sources. It's already fixed. https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=d7c7578148f0026bbe439f8920c62cb809206466 It will be included in the next RPM releases. (supposed to be released on August 20).

Date Modified	Username	Field	Change
2020-07-08 23:50	wsz	New Issue
2020-07-08 23:50	wsz	Tag Attached: failover
2020-07-08 23:50	wsz	Tag Attached: master slave
2020-07-08 23:50	wsz	Tag Attached: pcp commands
2020-07-08 23:50	wsz	Tag Attached: pcp_recovery_node
2020-07-09 11:47	t-ishii	Note Added: 0003429
2020-07-09 11:48	t-ishii	Assigned To	=> t-ishii
2020-07-09 11:48	t-ishii	Status	new => feedback
2020-07-09 17:28	wsz	File Added: data0-Wednesday.log
2020-07-09 17:28	wsz	File Added: data1-Wednesday.log
2020-07-09 17:28	wsz	Note Added: 0003430
2020-07-09 17:28	wsz	Status	feedback => assigned
2020-07-10 09:59	t-ishii	Note Added: 0003431
2020-07-19 20:21	t-ishii	Note Added: 0003432
2020-07-19 20:22	t-ishii	Status	assigned => feedback
2020-07-29 18:41	wsz	Note Added: 0003444
2020-07-29 18:41	wsz	Status	feedback => assigned
2020-07-29 18:54	wsz	Note Added: 0003445
2020-07-30 17:40	t-ishii	Note Added: 0003446
2020-07-30 17:41	t-ishii	Status	assigned => feedback
2020-07-31 23:09	wsz	Note Added: 0003447
2020-07-31 23:09	wsz	Status	feedback => assigned
2020-08-01 00:10	wsz	File Added: Friday-pgpool.log
2020-08-01 00:10	wsz	File Added: data0-Friday.log
2020-08-01 00:10	wsz	File Added: data1-Friday.log
2020-08-01 00:10	wsz	Note Added: 0003448
2020-08-04 08:21	t-ishii	Note Added: 0003451
2020-08-04 08:21	t-ishii	Status	assigned => feedback
2020-08-04 17:24	wsz	Note Added: 0003452
2020-08-04 17:24	wsz	Status	feedback => assigned
2020-08-04 21:39	t-ishii	Note Added: 0003453
2020-08-04 21:40	t-ishii	Note Edited: 0003453
2020-08-05 09:26	t-ishii	File Added: pgpool_setup.diff
2020-08-05 09:26	t-ishii	Note Added: 0003454
2020-08-05 09:26	t-ishii	Status	assigned => feedback
2020-08-05 10:53	t-ishii	Target Version	=> 4.1.3
2020-08-05 18:35	wsz	Note Added: 0003455
2020-08-05 18:35	wsz	Status	feedback => assigned
2020-08-05 19:07	t-ishii	File Added: pgpool_setup_v2.diff
2020-08-05 19:07	t-ishii	Note Added: 0003456
2020-08-05 21:46	wsz	Note Added: 0003457
2020-08-06 09:30	t-ishii	File Added: pgpool_setup_v3.diff
2020-08-06 09:30	t-ishii	Note Added: 0003458
2020-08-06 09:31	t-ishii	Note Edited: 0003458
2020-08-07 00:12	wsz	Note Added: 0003459
2020-08-07 12:14	t-ishii	Note Added: 0003460
2020-08-07 17:40	wsz	Note Added: 0003461
2020-08-07 18:45	t-ishii	Note Added: 0003462

View Issue Details

Activities

Issue History