[pgpool-general: 254] Pgpool freezes after pcp_attach_node command during database "warm up"
Aleksej Trofimov
aleksej.trofimov at ruptela.lt
Wed Feb 29 18:34:25 JST 2012
Hello,
Sorry if my question would be stupid or not in a common form for this
mailing list, because I am a new one =)
My problem is that pgpool freezes and not accepting connection after
some "coincidences" during pcp_attach_node command.
Let me start from server configuration:
We have 2 postgres 9.1 database servers (slave and master mode with hot
standby/streaming replication) and 1 pgpool node with pgpool-II version
3.1 (hatsuiboshi).
Pgpool configuration aspects:
backend_flag0 = 'DISALLOW_TO_FAILOVER' #Testing purposes
backend_weight0 = 0 #We are using
load_balance mode for read/write separation. Only write queries
allowed on master
backend_flag1 = 'ALLOW_TO_FAILOVER'
backend_weight1 = 1 # All read
queries are here
child_life_time = 300
child_max_connections = 20
connection_life_time = 0
client_idle_limit = 0
connection_cache = true
load_balance_mode = true
ignore_leading_white_space = on
master_slave_mode = true
master_slave_sub_mode = 'stream'
sr_check_period = 30
delay_threshold = 3145728
health_check_period = 10
health_check_timeout = 20
fail_over_on_backend_error = off
Every night we are doing slave database backup using 2 step rsync copy.
During the second stage backup we detaching slave database from pool:
#Backup script on slave database, stop section
ssh $POOLIP "pcp_detach_node $PCP $NODEID"
service postgresql-9.1 stop
#Postgres log from slave database after stop command
2012-02-29 03:42:58 EETLOG: received fast shutdown request
2012-02-29 03:42:58 EETLOG: aborting any active transactions
2012-02-29 03:42:58 EETFATAL: terminating walreceiver process due to
administrator command
2012-02-29 03:43:01 EETLOG: shutting down
2012-02-29 03:43:01 EETLOG: database system is shut down
#Pgpool log
2012-02-29 03:42:57 LOG: pid 21727: starting degeneration. shutdown
host slave.DB(5432)
2012-02-29 03:42:57 LOG: pid 21727: Restart all children
2012-02-29 03:42:57 LOG: pid 21727: find_primary_node_repeatedly:
waiting for finding a primary node
2012-02-29 03:42:58 LOG: pid 21727: find_primary_node: primary node id
is 0
2012-02-29 03:42:58 LOG: pid 21727: failover: set new primary node: 0
2012-02-29 03:42:58 LOG: pid 21727: failover: set new master node: 0
2012-02-29 03:42:58 LOG: pid 3528: do_child: failback event found.
restart myself.
2012-02-29 03:42:58 LOG: pid 3552: do_child: failback event found.
restart myself.
2012-02-29 03:42:58 LOG: pid 3558: do_child: failback event found.
restart myself.
2012-02-29 03:42:58 LOG: pid 3603: do_child: failback event found.
restart myself.
2012-02-29 03:42:58 LOG: pid 3635: do_child: failback event found.
restart myself.
2012-02-29 03:42:58 LOG: pid 21727: failover done. shutdown host
slave.DB(5432)
2012-02-29 03:42:59 LOG: pid 3268: pcp child process received restart
request
2012-02-29 03:42:59 LOG: pid 21727: PCP child 3268 exits with status 256
2012-02-29 03:42:59 LOG: pid 21727: fork a new PCP child pid 3649
2012-02-29 03:43:17 ERROR: pid 3267: connect_inet_domain_socket:
connect() failed: Connection refused
2012-02-29 03:43:17 ERROR: pid 3267: make_persistent_db_connection:
connection to slave.DB(5432) failed
2012-02-29 03:43:17 ERROR: pid 3267: check_replication_time_lag: could
not connect to DB node 1, check health check configuration
2012-02-29 03:43:47 ERROR: pid 3267: connect_inet_domain_socket:
connect() failed: Connection refused
2012-02-29 03:43:47 ERROR: pid 3267: make_persistent_db_connection:
connection to slave.DB(5432) failed
2012-02-29 03:43:47 ERROR: pid 3267: check_replication_time_lag: could
not connect to DB node 1, check health check configuration
#AND A LOT OF connect_inet_domain_socket: connect() failed errors till start
Everything works great till the backup complete.
#Backup script on slave database, start section
service postgresql-9.1 start
ssh $POOLIP "pcp_attach_node $PCP $NODEID"
#Postgresql log
2012-02-29 05:45:10 EETLOG: database system was shut down in recovery
at 2012-02-29 03:43:01 EET
2012-02-29 05:45:10 EETLOG: entering standby mode
2012-02-29 05:45:10 EETLOG: redo starts at 3C1/BAF74AC0
2012-02-29 05:45:22 EETLOG: consistent recovery state reached at
3C1/C22C4A58
2012-02-29 05:45:22 EETLOG: invalid record length at 3C1/C22C4A58
2012-02-29 05:45:22 EETLOG: database system is ready to accept read
only connections
2012-02-29 05:45:23 EETLOG: streaming replication successfully
connected to primary
#Pgpool log
2012-02-29 05:45:08 LOG: pid 3649: send_failback_request: fail back 1
th node request from pid 3649
2012-02-29 05:45:08 LOG: pid 21727: starting fail back. reconnect host
slave.DB(5432)
2012-02-29 05:45:08 LOG: pid 21727: execute command:
/usr/local/etc/failback.sh 1 slave.DB 0 0
2012-02-29 05:45:08 LOG: pid 21727: Do not restart children because we
are failbacking node id 1 hostslave.DB port:5432 and we are in streaming
replication mode
2012-02-29 05:45:08 LOG: pid 21727: find_primary_node_repeatedly:
waiting for finding a primary node
2012-02-29 05:45:08 LOG: pid 21727: find_primary_node: primary node id
is 0
2012-02-29 05:45:08 LOG: pid 21727: failover: set new primary node: 0
2012-02-29 05:45:08 LOG: pid 21727: failover: set new master node: 0
2012-02-29 05:45:08 LOG: pid 21727: failback done. reconnect host
slave.DB(5432)
2012-02-29 05:45:08 LOG: pid 3267: worker process received restart request
2012-02-29 05:45:08 LOG: pid 14068: do_child: failback event found.
restart myself.
2012-02-29 05:45:08 LOG: pid 17751: do_child: failback event found.
restart myself.
2012-02-29 05:45:08 LOG: pid 23681: do_child: failback event found.
discard existing connections
2012-02-29 05:45:08 ERROR: pid 23681: connect_inet_domain_socket:
connect() failed: Connection refused
2012-02-29 05:45:08 ERROR: pid 23681: connection to slave.DB(5432) failed
2012-02-29 05:45:08 ERROR: pid 23681: new_connection: create_cp() failed
2012-02-29 05:45:08 LOG: pid 23681: degenerate_backend_set: 1 fail
over request from pid 23681
2012-02-29 05:45:09 LOG: pid 9514: do_child: failback event found.
restart myself.
2012-02-29 05:45:09 LOG: pid 21524: do_child: failback event found.
discard existing connections
2012-02-29 05:45:09 ERROR: pid 21524: connect_inet_domain_socket:
connect() failed: Connection refused
2012-02-29 05:45:09 ERROR: pid 21524: connection to slave.DB(5432) failed
2012-02-29 05:45:09 ERROR: pid 21524: new_connection: create_cp() failed
2012-02-29 05:45:09 LOG: pid 21524: degenerate_backend_set: 1 fail
over request from pid 21524
2012-02-29 05:45:09 LOG: pid 21727: starting degeneration. shutdown
host slave.DB(5432)
2012-02-29 05:45:09 LOG: pid 3649: pcp child process received restart
request
2012-02-29 05:45:09 LOG: pid 21727: Restart all children
2012-02-29 05:45:09 LOG: pid 21727: find_primary_node_repeatedly:
waiting for finding a primary node
2012-02-29 05:45:09 LOG: pid 21727: find_primary_node: primary node id
is 0
2012-02-29 05:45:09 LOG: pid 21727: failover: set new primary node: 0
2012-02-29 05:45:09 LOG: pid 21727: failover: set new master node: 0
2012-02-29 05:45:09 LOG: pid 23928: do_child: failback event found.
restart myself.
#Here is 100 other log messages with pid XXXX: do_child: failback event
found. restart myself.
2012-02-29 05:45:09 LOG: pid 21727: failover done. shutdown host
slave.DB(5432)
2012-02-29 05:45:09 LOG: pid 24033: do_child: failback event found.
restart myself.
2012-02-29 05:45:09 LOG: pid 24032: do_child: failback event found.
restart myself.
2012-02-29 05:45:09 LOG: pid 24034: do_child: failback event found.
restart myself.
2012-02-29 05:45:09 LOG: pid 24028: do_child: failback event found.
restart myself.
2012-02-29 05:45:10 LOG: pid 21727: worker child 3267 exits with
status 256
2012-02-29 05:45:10 LOG: pid 21727: fork a new worker child pid 24053
2012-02-29 05:45:10 LOG: pid 21727: PCP child 3649 exits with status 256
2012-02-29 05:45:10 LOG: pid 21727: fork a new PCP child pid 24054
###After this point pgpool freezes and not accepting connections, there
is no log lines after this point until I done pcp_attach_node 1 manually
2012-02-29 07:33:52 LOG: pid 24054: send_failback_request: fail back 1
th node request from pid 24054
2012-02-29 07:33:52 LOG: pid 21727: starting fail back. reconnect host
slave.DB(5432)
###Everything goes normal after this point
As I understood from log files, the main problem is in backup script
start part, after slave database started it took several second for
database to "warm up" but I'm trying to attach node earlier(at 05:45:08)
than database starts(at 05:45:10). But I do not understand why pool
freezes when it finds such a conflict (when it is not possible to attach
node)?
This error only persists when trying to attach node during database
"warm up" process.
P.S two servers are perfectly synchronized in time =)
--
Best regards
Aleksej Trofimov
More information about the pgpool-general
mailing list