[Pgpool-general] Issue with master node selection in pgpool-II-3.0.3-beta1

Mon Feb 21 20:46:26 UTC 2011

Hello,

I am having a problem with selection of master node when I manually
attach slave node (with id #1) which came up after postgresql server reboot.
When I attach the slave, pgpool tries to select master node, and for
some reason reports that previously working master node (with id #0) is no longer
available, marks the master node as unavailable, and selects the newly
attached node (id #1) as master. If after that I manually attach the node with
id #0 back to the pool it becomes a master.

More detailed explanation.

== Setup ==
Two postgresql nodes (running postgresql 9.0.3 on FreeBSD 8.0):
Master node (storage1) and a slave one (storage2).
The system uses streaming replication with the slave node running in hot
standby mode.
pgpool-II-3.0.3-beta1 is installed on storage1 node with the following
configuration file:

======================================================================
zaa:...local/etc# cat pgpool.conf | sort | grep -v '^#' | grep -v '^$'
authentication_timeout = 30
backend_hostname0 = 'storage1'
backend_hostname1 = 'storage2'
backend_port0 = 5432
backend_port1 = 5432
backend_weight0 = 0.01
backend_weight1 = 0.99
black_function_list = 'nextval,setval'
child_idle_limit = 0
child_life_time = 300
child_max_connections = 0
connection_cache = true
connection_life_time = 0
debug_level = 1
enable_pool_hba = false
fail_over_on_backend_error = false
failover_command = ''
health_check_period = 5
health_check_user = 'pgpool'
listen_addresses = 'localhost'
load_balance_mode = true
log_connections = true
log_hostname = false
log_per_node_statement = true
log_standby_delay = 'always'
log_statement = true
master_slave_mode = true
master_slave_sub_mode = 'stream'
max_pool = 5
num_init_children = 2
parallel_mode = false
pid_file_name = '/var/run/pgpool/pgpool.pid'
port = 9999
print_timestamp = true
replication_mode = false
reset_query_list = 'ABORT; RESET ALL; SET SESSION AUTHORIZATION DEFAULT'
socket_dir = '/tmp'
white_function_list = ''
======================================================================

I can successfully connect to local port 9999 and issue DML and select
queries.

== Problem ==

1. I shutdown postgresql on slave node storage2 and see the following
lines in syslog on the node with pgpool:

======================================================================
Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 DEBUG: pid 53354: detect_stop_postmaster_error: receive admin shutdown error from a node.
Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 LOG:   pid 53354: postmaster on DB node 1 was shutdown by administrative command
Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 LOG:   pid 53354: notice_backend_error: 1 fail over request from pid 53354
Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 DEBUG: pid 52999: failover_handler called
Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 DEBUG: pid 52999: failover_handler: starting to select new master node
Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 LOG:   pid 52999: starting degeneration. shutdown host storage2(5432)
Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 DEBUG: pid 53354: child received shutdown request signal 3
Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 DEBUG: pid 52999: failover_handler: kill 53354
Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 DEBUG: pid 52999: failover_handler: kill 53355
Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 DEBUG: pid 53355: child received shutdown request signal 3
Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 LOG:   pid 52999: failover_handler: set new master node: 0
...
Feb 21 23:22:34 zaa pgpool: 2011-02-21 23:22:34 DEBUG: pid 52999: starting health checking
Feb 21 23:22:34 zaa pgpool: 2011-02-21 23:22:34 DEBUG: pid 52999: health_check: 0 th DB node status: 2
Feb 21 23:22:34 zaa pgpool: 2011-02-21 23:22:34 DEBUG: pid 52999: health_check: 1 th DB node status: 3
======================================================================

As you can see, pgpool correctly marks storage2 node as being down.

2. After a while I start postgres on node storage2.
3. In order to return the node back into the pool I run the following
command:

pcp_attach_node -d 120 localhost 9898 postgres postgres 1

and see those lines in syslog on storage1:

======================================================================
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 LOG:   pid 53003: send_failback_request: fail back 1 th node request from pid 53003
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: failover_handler called
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: failover_handler: starting to select new master node
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 LOG:   pid 52999: starting fail back. reconnect host storage2(5432)
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: failover_handler: kill 56131
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 56131: child received shutdown request signal 3
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: failover_handler: kill 56132
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 56132: child received shutdown request signal 3
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 LOG:   pid 52999: failover_handler: set new master node: 0
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 LOG:   pid 52999: failback done. reconnect host storage2(5432)
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 56205: I am 56205
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: pool_ssl: SSL requested but SSL support is not available
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 56206: I am 56206
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: s_do_auth: auth kind: 0
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: s_do_auth: parameter status data received
Feb 21 23:28:03 zaa last message repeated 10 times
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: s_do_auth: backend key data received
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: s_do_auth: transaction state: I
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: do_query: kind: T
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: num_fileds: 1
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: do_query: kind: D
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: do_query: kind: C
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: do_query: kind: Z
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: do_query: kind: T
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: num_fileds: 1
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: do_query: kind: D
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: do_query: kind: C
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: do_query: kind: Z
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 LOG:   pid 52999: find_primary_node: primary node id is 0
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: reap_handler called
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: reap_handler: call wait3
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: child 56132 exits with status 0
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: child 56131 exits with status 0
Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: reap_handler: normally exited
Feb 21 23:28:05 zaa pgpool: 2011-02-21 23:28:05 DEBUG: pid 52999: starting health checking
Feb 21 23:28:05 zaa pgpool: 2011-02-21 23:28:05 DEBUG: pid 52999: health_check: 0 th DB node status: 2
Feb 21 23:28:05 zaa pgpool: 2011-02-21 23:28:05 DEBUG: pid 52999: health_check: 1 th DB node status: 1
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56021: do_query: kind: T
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56021: num_fileds: 1
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56021: do_query: kind: D
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56021: do_query: kind: C
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56021: do_query: kind: Z
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56021: do_query: kind: E
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 ERROR: pid 56021: pool_read: read failed (Connection reset by peer)
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 56021: notice_backend_error: 0 fail over request from pid 56021
Feb 21 23:28:08 zaa postgres[56090]: [5-1] LOG:  unexpected EOF on client connection
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: failover_handler called
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: failover_handler: starting to select new master node
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: starting degeneration. shutdown host storage1(5432)
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: failover_handler: kill 56205
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56205: child received shutdown request signal 3
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: failover_handler: kill 56206
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: failover_handler: set new master node: 1
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56206: child received shutdown request signal 3
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: failover done. shutdown host storage1(5432)
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56213: I am 56213
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56214: I am 56214
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: pool_ssl: SSL requested but SSL support is not available
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: s_do_auth: auth kind: 0
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: s_do_auth: parameter status data received
Feb 21 23:28:08 zaa last message repeated 10 times
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: s_do_auth: backend key data received
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: s_do_auth: transaction state: I
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: do_query: kind: T
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: num_fileds: 1
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: do_query: kind: D
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: do_query: kind: C
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: do_query: kind: Z
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: do_query: kind: T
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: num_fileds: 1
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: do_query: kind: E
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: do_query: kind: Z
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: find_primary_node: do_query returns no rows
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: find_primary_node: do_query returns no data
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: find_primary_node: primary node id is 1
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: reap_handler called
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: reap_handler: call wait3
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: child 56206 exits with status 0
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: child 56205 exits with status 0
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 ERROR: pid 52999: Child process 56021 was terminated by segmentation fault
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: worker child 56021 exits with status 11 by signal 11
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: fork a new worker child pid 56216
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: reap_handler: normally exited
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56216: I am 56216
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56216: pool_ssl: SSL requested but SSL support is not available
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56216: s_do_auth: auth kind: 0
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56216: s_do_auth: parameter status data received
Feb 21 23:28:08 zaa last message repeated 10 times
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56216: s_do_auth: backend key data received
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56216: s_do_auth: transaction state: I
Feb 21 23:28:10 zaa pgpool: 2011-02-21 23:28:10 DEBUG: pid 52999: starting health checking
Feb 21 23:28:10 zaa pgpool: 2011-02-21 23:28:10 DEBUG: pid 52999: health_check: 0 th DB node status: 3
Feb 21 23:28:10 zaa pgpool: 2011-02-21 23:28:10 DEBUG: pid 52999: health_check: 1 th DB node status: 1
======================================================================

As you can see, at first pgpool correctly chooses storage1 node as master, but
then for some reason it gets information about the master node (storage1)
unavailability:

======================================================================
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 ERROR: pid 56021: pool_read: read failed (Connection reset by peer)
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 56021: notice_backend_error: 0 fail over request from pid 56021
Feb 21 23:28:08 zaa postgres[56090]: [5-1] LOG:  unexpected EOF on client connection
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: failover_handler called
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: failover_handler: starting to select new master node
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: starting degeneration. shutdown host storage1(5432)
======================================================================

slave node storage2 becomes the master node and health check shows storage1
node as unavailable.

If I run "pcp_attach_node -d 120 localhost 9898 postgres postgres 0" command,
storage1 node again becomes the master node as it should be. 

Could someone please let me know what might be causing the issue? Is this a bug
or I am doing something wrong?

PS. One more interesting issue is presence of those two lines:
======================================================================
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 ERROR: pid 52999: Child process 56021 was terminated by segmentation fault
Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: worker child 56021 exits with status 11 by signal 11
======================================================================

-- 
Alexander Zhuravlev
http://ikato.com/