[Pgpool-general] Issue with master node selection in pgpool-II-3.0.3-beta1

Thu Feb 24 00:43:59 UTC 2011

Zhuravlev,

I found a bug possibly related to your report.  Pgpool worker process
have a persistent connection to backends to check replication
delay. If a backend goes down and then up, the process tries to keep
using the connection without knowing the fact. I think this is the
cause of the problems you have seen.

The fix was included in 3.0.3, released yesterday (though no mention
about this on the NEWS file).

http://lists.pgfoundry.org/pipermail/pgpool-committers/2011-February/001669.html

Now the process creates everytime when replication delay checked. This
is add a little bit resource consumption, but I couldn't think of any
alternative.

Could you please test 3.0.3 and report back to me if your problems go
away?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> Zhuravlev,
> 
> I couldn't reproduce your problem here.  Since both
> problems(unexpected failover and segfault) were occured in "worker
> process", I would like to take a backtrace of the process.
> The process looks like:
> 
> 23567 ?        S      0:00 pgpool: worker process
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
> 
>> Thanks for the report. We are looking into this.
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese: http://www.sraoss.co.jp
>> 
>>> Hello,
>>> 
>>> I am having a problem with selection of master node when I manually
>>> attach slave node (with id #1) which came up after postgresql server reboot.
>>> When I attach the slave, pgpool tries to select master node, and for
>>> some reason reports that previously working master node (with id #0) is no longer
>>> available, marks the master node as unavailable, and selects the newly
>>> attached node (id #1) as master. If after that I manually attach the node with
>>> id #0 back to the pool it becomes a master.
>>> 
>>> More detailed explanation.
>>> 
>>> == Setup ==
>>> Two postgresql nodes (running postgresql 9.0.3 on FreeBSD 8.0):
>>> Master node (storage1) and a slave one (storage2).
>>> The system uses streaming replication with the slave node running in hot
>>> standby mode.
>>> pgpool-II-3.0.3-beta1 is installed on storage1 node with the following
>>> configuration file:
>>> 
>>> ======================================================================
>>> zaa:...local/etc# cat pgpool.conf | sort | grep -v '^#' | grep -v '^$'
>>> authentication_timeout = 30
>>> backend_hostname0 = 'storage1'
>>> backend_hostname1 = 'storage2'
>>> backend_port0 = 5432
>>> backend_port1 = 5432
>>> backend_weight0 = 0.01
>>> backend_weight1 = 0.99
>>> black_function_list = 'nextval,setval'
>>> child_idle_limit = 0
>>> child_life_time = 300
>>> child_max_connections = 0
>>> connection_cache = true
>>> connection_life_time = 0
>>> debug_level = 1
>>> enable_pool_hba = false
>>> fail_over_on_backend_error = false
>>> failover_command = ''
>>> health_check_period = 5
>>> health_check_user = 'pgpool'
>>> listen_addresses = 'localhost'
>>> load_balance_mode = true
>>> log_connections = true
>>> log_hostname = false
>>> log_per_node_statement = true
>>> log_standby_delay = 'always'
>>> log_statement = true
>>> master_slave_mode = true
>>> master_slave_sub_mode = 'stream'
>>> max_pool = 5
>>> num_init_children = 2
>>> parallel_mode = false
>>> pid_file_name = '/var/run/pgpool/pgpool.pid'
>>> port = 9999
>>> print_timestamp = true
>>> replication_mode = false
>>> reset_query_list = 'ABORT; RESET ALL; SET SESSION AUTHORIZATION DEFAULT'
>>> socket_dir = '/tmp'
>>> white_function_list = ''
>>> ======================================================================
>>> 
>>> I can successfully connect to local port 9999 and issue DML and select
>>> queries.
>>> 
>>> == Problem ==
>>> 
>>> 1. I shutdown postgresql on slave node storage2 and see the following
>>> lines in syslog on the node with pgpool:
>>> 
>>> ======================================================================
>>> Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 DEBUG: pid 53354: detect_stop_postmaster_error: receive admin shutdown error from a node.
>>> Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 LOG:   pid 53354: postmaster on DB node 1 was shutdown by administrative command
>>> Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 LOG:   pid 53354: notice_backend_error: 1 fail over request from pid 53354
>>> Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 DEBUG: pid 52999: failover_handler called
>>> Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 DEBUG: pid 52999: failover_handler: starting to select new master node
>>> Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 LOG:   pid 52999: starting degeneration. shutdown host storage2(5432)
>>> Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 DEBUG: pid 53354: child received shutdown request signal 3
>>> Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 DEBUG: pid 52999: failover_handler: kill 53354
>>> Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 DEBUG: pid 52999: failover_handler: kill 53355
>>> Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 DEBUG: pid 53355: child received shutdown request signal 3
>>> Feb 21 23:22:33 zaa pgpool: 2011-02-21 23:22:33 LOG:   pid 52999: failover_handler: set new master node: 0
>>> ...
>>> Feb 21 23:22:34 zaa pgpool: 2011-02-21 23:22:34 DEBUG: pid 52999: starting health checking
>>> Feb 21 23:22:34 zaa pgpool: 2011-02-21 23:22:34 DEBUG: pid 52999: health_check: 0 th DB node status: 2
>>> Feb 21 23:22:34 zaa pgpool: 2011-02-21 23:22:34 DEBUG: pid 52999: health_check: 1 th DB node status: 3
>>> ======================================================================
>>> 
>>> As you can see, pgpool correctly marks storage2 node as being down.
>>> 
>>> 2. After a while I start postgres on node storage2.
>>> 3. In order to return the node back into the pool I run the following
>>> command:
>>> 
>>> pcp_attach_node -d 120 localhost 9898 postgres postgres 1
>>> 
>>> and see those lines in syslog on storage1:
>>> 
>>> ======================================================================
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 LOG:   pid 53003: send_failback_request: fail back 1 th node request from pid 53003
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: failover_handler called
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: failover_handler: starting to select new master node
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 LOG:   pid 52999: starting fail back. reconnect host storage2(5432)
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: failover_handler: kill 56131
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 56131: child received shutdown request signal 3
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: failover_handler: kill 56132
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 56132: child received shutdown request signal 3
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 LOG:   pid 52999: failover_handler: set new master node: 0
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 LOG:   pid 52999: failback done. reconnect host storage2(5432)
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 56205: I am 56205
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: pool_ssl: SSL requested but SSL support is not available
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 56206: I am 56206
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: s_do_auth: auth kind: 0
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: s_do_auth: parameter status data received
>>> Feb 21 23:28:03 zaa last message repeated 10 times
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: s_do_auth: backend key data received
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: s_do_auth: transaction state: I
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: do_query: kind: T
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: num_fileds: 1
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: do_query: kind: D
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: do_query: kind: C
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: do_query: kind: Z
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: do_query: kind: T
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: num_fileds: 1
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: do_query: kind: D
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: do_query: kind: C
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: do_query: kind: Z
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 LOG:   pid 52999: find_primary_node: primary node id is 0
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: reap_handler called
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: reap_handler: call wait3
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: child 56132 exits with status 0
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: child 56131 exits with status 0
>>> Feb 21 23:28:03 zaa pgpool: 2011-02-21 23:28:03 DEBUG: pid 52999: reap_handler: normally exited
>>> Feb 21 23:28:05 zaa pgpool: 2011-02-21 23:28:05 DEBUG: pid 52999: starting health checking
>>> Feb 21 23:28:05 zaa pgpool: 2011-02-21 23:28:05 DEBUG: pid 52999: health_check: 0 th DB node status: 2
>>> Feb 21 23:28:05 zaa pgpool: 2011-02-21 23:28:05 DEBUG: pid 52999: health_check: 1 th DB node status: 1
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56021: do_query: kind: T
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56021: num_fileds: 1
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56021: do_query: kind: D
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56021: do_query: kind: C
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56021: do_query: kind: Z
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56021: do_query: kind: E
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 ERROR: pid 56021: pool_read: read failed (Connection reset by peer)
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 56021: notice_backend_error: 0 fail over request from pid 56021
>>> Feb 21 23:28:08 zaa postgres[56090]: [5-1] LOG:  unexpected EOF on client connection
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: failover_handler called
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: failover_handler: starting to select new master node
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: starting degeneration. shutdown host storage1(5432)
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: failover_handler: kill 56205
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56205: child received shutdown request signal 3
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: failover_handler: kill 56206
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: failover_handler: set new master node: 1
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56206: child received shutdown request signal 3
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: failover done. shutdown host storage1(5432)
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56213: I am 56213
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56214: I am 56214
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: pool_ssl: SSL requested but SSL support is not available
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: s_do_auth: auth kind: 0
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: s_do_auth: parameter status data received
>>> Feb 21 23:28:08 zaa last message repeated 10 times
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: s_do_auth: backend key data received
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: s_do_auth: transaction state: I
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: do_query: kind: T
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: num_fileds: 1
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: do_query: kind: D
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: do_query: kind: C
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: do_query: kind: Z
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: do_query: kind: T
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: num_fileds: 1
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: do_query: kind: E
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: do_query: kind: Z
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: find_primary_node: do_query returns no rows
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: find_primary_node: do_query returns no data
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: find_primary_node: primary node id is 1
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: reap_handler called
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: reap_handler: call wait3
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: child 56206 exits with status 0
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: child 56205 exits with status 0
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 ERROR: pid 52999: Child process 56021 was terminated by segmentation fault
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: worker child 56021 exits with status 11 by signal 11
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: fork a new worker child pid 56216
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: reap_handler: normally exited
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56216: I am 56216
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56216: pool_ssl: SSL requested but SSL support is not available
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56216: s_do_auth: auth kind: 0
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56216: s_do_auth: parameter status data received
>>> Feb 21 23:28:08 zaa last message repeated 10 times
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56216: s_do_auth: backend key data received
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 56216: s_do_auth: transaction state: I
>>> Feb 21 23:28:10 zaa pgpool: 2011-02-21 23:28:10 DEBUG: pid 52999: starting health checking
>>> Feb 21 23:28:10 zaa pgpool: 2011-02-21 23:28:10 DEBUG: pid 52999: health_check: 0 th DB node status: 3
>>> Feb 21 23:28:10 zaa pgpool: 2011-02-21 23:28:10 DEBUG: pid 52999: health_check: 1 th DB node status: 1
>>> ======================================================================
>>> 
>>> As you can see, at first pgpool correctly chooses storage1 node as master, but
>>> then for some reason it gets information about the master node (storage1)
>>> unavailability:
>>> 
>>> ======================================================================
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 ERROR: pid 56021: pool_read: read failed (Connection reset by peer)
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 56021: notice_backend_error: 0 fail over request from pid 56021
>>> Feb 21 23:28:08 zaa postgres[56090]: [5-1] LOG:  unexpected EOF on client connection
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: failover_handler called
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 DEBUG: pid 52999: failover_handler: starting to select new master node
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: starting degeneration. shutdown host storage1(5432)
>>> ======================================================================
>>> 
>>> slave node storage2 becomes the master node and health check shows storage1
>>> node as unavailable.
>>> 
>>> If I run "pcp_attach_node -d 120 localhost 9898 postgres postgres 0" command,
>>> storage1 node again becomes the master node as it should be. 
>>> 
>>> Could someone please let me know what might be causing the issue? Is this a bug
>>> or I am doing something wrong?
>>> 
>>> PS. One more interesting issue is presence of those two lines:
>>> ======================================================================
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 ERROR: pid 52999: Child process 56021 was terminated by segmentation fault
>>> Feb 21 23:28:08 zaa pgpool: 2011-02-21 23:28:08 LOG:   pid 52999: worker child 56021 exits with status 11 by signal 11
>>> ======================================================================
>>> 
>>> -- 
>>> Alexander Zhuravlev
>>> http://ikato.com/
>>> _______________________________________________
>>> Pgpool-general mailing list
>>> Pgpool-general at pgfoundry.org
>>> http://pgfoundry.org/mailman/listinfo/pgpool-general
>> _______________________________________________
>> Pgpool-general mailing list
>> Pgpool-general at pgfoundry.org
>> http://pgfoundry.org/mailman/listinfo/pgpool-general
> _______________________________________________
> Pgpool-general mailing list
> Pgpool-general at pgfoundry.org
> http://pgfoundry.org/mailman/listinfo/pgpool-general