[pgpool-general: 254] Pgpool freezes after pcp_attach_node command during database "warm up"

Aleksej Trofimov aleksej.trofimov at ruptela.lt
Wed Feb 29 18:34:25 JST 2012


Hello,
   Sorry if my question would be stupid or not in a common form for this 
mailing list, because I am a new one =)

My problem is that pgpool freezes and not accepting connection after 
some "coincidences" during pcp_attach_node command.

Let me start from server configuration:
We have 2 postgres 9.1 database servers (slave and master mode with hot 
standby/streaming replication) and 1 pgpool node with pgpool-II version 
3.1 (hatsuiboshi).

Pgpool configuration aspects:

backend_flag0 = 'DISALLOW_TO_FAILOVER'   #Testing purposes
backend_weight0 = 0                                      #We are using 
load_balance mode for read/write     separation. Only write queries 
allowed on master

backend_flag1 = 'ALLOW_TO_FAILOVER'
backend_weight1 = 1                                        # All read 
queries are here

child_life_time = 300
child_max_connections = 20
connection_life_time = 0
client_idle_limit = 0

connection_cache = true

load_balance_mode = true
ignore_leading_white_space = on

master_slave_mode = true
master_slave_sub_mode = 'stream'
sr_check_period = 30
delay_threshold = 3145728

health_check_period = 10
health_check_timeout = 20
fail_over_on_backend_error = off

Every night we are doing slave database backup using 2 step rsync copy. 
During the second stage backup we detaching slave database from pool:

#Backup script on slave database, stop section
ssh $POOLIP "pcp_detach_node $PCP $NODEID"
service postgresql-9.1 stop

#Postgres log from slave database after stop command
2012-02-29 03:42:58 EETLOG:  received fast shutdown request
2012-02-29 03:42:58 EETLOG:  aborting any active transactions
2012-02-29 03:42:58 EETFATAL:  terminating walreceiver process due to 
administrator command
2012-02-29 03:43:01 EETLOG:  shutting down
2012-02-29 03:43:01 EETLOG:  database system is shut down

#Pgpool log
2012-02-29 03:42:57 LOG:   pid 21727: starting degeneration. shutdown 
host slave.DB(5432)
2012-02-29 03:42:57 LOG:   pid 21727: Restart all children
2012-02-29 03:42:57 LOG:   pid 21727: find_primary_node_repeatedly: 
waiting for finding a primary node
2012-02-29 03:42:58 LOG:   pid 21727: find_primary_node: primary node id 
is 0
2012-02-29 03:42:58 LOG:   pid 21727: failover: set new primary node: 0
2012-02-29 03:42:58 LOG:   pid 21727: failover: set new master node: 0
2012-02-29 03:42:58 LOG:   pid 3528: do_child: failback event found. 
restart myself.
2012-02-29 03:42:58 LOG:   pid 3552: do_child: failback event found. 
restart myself.
2012-02-29 03:42:58 LOG:   pid 3558: do_child: failback event found. 
restart myself.
2012-02-29 03:42:58 LOG:   pid 3603: do_child: failback event found. 
restart myself.
2012-02-29 03:42:58 LOG:   pid 3635: do_child: failback event found. 
restart myself.
2012-02-29 03:42:58 LOG:   pid 21727: failover done. shutdown host 
slave.DB(5432)
2012-02-29 03:42:59 LOG:   pid 3268: pcp child process received restart 
request
2012-02-29 03:42:59 LOG:   pid 21727: PCP child 3268 exits with status 256
2012-02-29 03:42:59 LOG:   pid 21727: fork a new PCP child pid 3649
2012-02-29 03:43:17 ERROR: pid 3267: connect_inet_domain_socket: 
connect() failed: Connection refused
2012-02-29 03:43:17 ERROR: pid 3267: make_persistent_db_connection: 
connection to slave.DB(5432) failed
2012-02-29 03:43:17 ERROR: pid 3267: check_replication_time_lag: could 
not connect to DB node 1, check health check configuration
2012-02-29 03:43:47 ERROR: pid 3267: connect_inet_domain_socket: 
connect() failed: Connection refused
2012-02-29 03:43:47 ERROR: pid 3267: make_persistent_db_connection: 
connection to slave.DB(5432) failed
2012-02-29 03:43:47 ERROR: pid 3267: check_replication_time_lag: could 
not connect to DB node 1, check health check configuration
#AND A LOT OF connect_inet_domain_socket: connect() failed errors till start

Everything works great till the backup complete.

#Backup script on slave database, start section
service postgresql-9.1 start
ssh $POOLIP "pcp_attach_node $PCP $NODEID"

#Postgresql log
2012-02-29 05:45:10 EETLOG:  database system was shut down in recovery 
at 2012-02-29 03:43:01 EET
2012-02-29 05:45:10 EETLOG:  entering standby mode
2012-02-29 05:45:10 EETLOG:  redo starts at 3C1/BAF74AC0
2012-02-29 05:45:22 EETLOG:  consistent recovery state reached at 
3C1/C22C4A58
2012-02-29 05:45:22 EETLOG:  invalid record length at 3C1/C22C4A58
2012-02-29 05:45:22 EETLOG:  database system is ready to accept read 
only connections
2012-02-29 05:45:23 EETLOG:  streaming replication successfully 
connected to primary

#Pgpool log
2012-02-29 05:45:08 LOG:   pid 3649: send_failback_request: fail back 1 
th node request from pid 3649
2012-02-29 05:45:08 LOG:   pid 21727: starting fail back. reconnect host 
slave.DB(5432)
2012-02-29 05:45:08 LOG:   pid 21727: execute command: 
/usr/local/etc/failback.sh 1 slave.DB 0 0
2012-02-29 05:45:08 LOG:   pid 21727: Do not restart children because we 
are failbacking node id 1 hostslave.DB port:5432 and we are in streaming 
replication mode
2012-02-29 05:45:08 LOG:   pid 21727: find_primary_node_repeatedly: 
waiting for finding a primary node
2012-02-29 05:45:08 LOG:   pid 21727: find_primary_node: primary node id 
is 0
2012-02-29 05:45:08 LOG:   pid 21727: failover: set new primary node: 0
2012-02-29 05:45:08 LOG:   pid 21727: failover: set new master node: 0
2012-02-29 05:45:08 LOG:   pid 21727: failback done. reconnect host 
slave.DB(5432)
2012-02-29 05:45:08 LOG:   pid 3267: worker process received restart request
2012-02-29 05:45:08 LOG:   pid 14068: do_child: failback event found. 
restart myself.
2012-02-29 05:45:08 LOG:   pid 17751: do_child: failback event found. 
restart myself.
2012-02-29 05:45:08 LOG:   pid 23681: do_child: failback event found. 
discard existing connections
2012-02-29 05:45:08 ERROR: pid 23681: connect_inet_domain_socket: 
connect() failed: Connection refused
2012-02-29 05:45:08 ERROR: pid 23681: connection to slave.DB(5432) failed
2012-02-29 05:45:08 ERROR: pid 23681: new_connection: create_cp() failed
2012-02-29 05:45:08 LOG:   pid 23681: degenerate_backend_set: 1 fail 
over request from pid 23681
2012-02-29 05:45:09 LOG:   pid 9514: do_child: failback event found. 
restart myself.
2012-02-29 05:45:09 LOG:   pid 21524: do_child: failback event found. 
discard existing connections
2012-02-29 05:45:09 ERROR: pid 21524: connect_inet_domain_socket: 
connect() failed: Connection refused
2012-02-29 05:45:09 ERROR: pid 21524: connection to slave.DB(5432) failed
2012-02-29 05:45:09 ERROR: pid 21524: new_connection: create_cp() failed
2012-02-29 05:45:09 LOG:   pid 21524: degenerate_backend_set: 1 fail 
over request from pid 21524
2012-02-29 05:45:09 LOG:   pid 21727: starting degeneration. shutdown 
host slave.DB(5432)
2012-02-29 05:45:09 LOG:   pid 3649: pcp child process received restart 
request
2012-02-29 05:45:09 LOG:   pid 21727: Restart all children
2012-02-29 05:45:09 LOG:   pid 21727: find_primary_node_repeatedly: 
waiting for finding a primary node
2012-02-29 05:45:09 LOG:   pid 21727: find_primary_node: primary node id 
is 0
2012-02-29 05:45:09 LOG:   pid 21727: failover: set new primary node: 0
2012-02-29 05:45:09 LOG:   pid 21727: failover: set new master node: 0
2012-02-29 05:45:09 LOG:   pid 23928: do_child: failback event found. 
restart myself.
#Here is 100 other log messages with pid XXXX: do_child: failback event 
found. restart myself.
2012-02-29 05:45:09 LOG:   pid 21727: failover done. shutdown host 
slave.DB(5432)
2012-02-29 05:45:09 LOG:   pid 24033: do_child: failback event found. 
restart myself.
2012-02-29 05:45:09 LOG:   pid 24032: do_child: failback event found. 
restart myself.
2012-02-29 05:45:09 LOG:   pid 24034: do_child: failback event found. 
restart myself.
2012-02-29 05:45:09 LOG:   pid 24028: do_child: failback event found. 
restart myself.
2012-02-29 05:45:10 LOG:   pid 21727: worker child 3267 exits with 
status 256
2012-02-29 05:45:10 LOG:   pid 21727: fork a new worker child pid 24053
2012-02-29 05:45:10 LOG:   pid 21727: PCP child 3649 exits with status 256
2012-02-29 05:45:10 LOG:   pid 21727: fork a new PCP child pid 24054
###After this point pgpool freezes and not accepting connections, there 
is no log lines after this point until I done pcp_attach_node 1 manually
2012-02-29 07:33:52 LOG:   pid 24054: send_failback_request: fail back 1 
th node request from pid 24054
2012-02-29 07:33:52 LOG:   pid 21727: starting fail back. reconnect host 
slave.DB(5432)
###Everything goes normal after this point

As I understood from log files, the main problem is in backup script 
start part, after slave database started it took several second for 
database to "warm up" but I'm trying to attach node earlier(at 05:45:08) 
than database starts(at 05:45:10). But I do not understand why pool 
freezes when it finds such a conflict (when it is not possible to attach 
node)?

This error only persists when trying to attach node during database 
"warm up" process.

P.S two servers are perfectly synchronized in time =)













-- 
Best regards

Aleksej Trofimov



More information about the pgpool-general mailing list