[pgpool-general: 2944] Re: Pgpool cluster with PostgreSQL 9.3

Alexandru Cardaniuc cardaniuc at gmail.com
Sun Jun 22 15:05:12 JST 2014


Also to add, for some reason with this configuration and setup, while
testing failover and failing the master database, pgpool correctly
identified the master as failed and detached
 it from the cluster. But all writes failed to the slave since the trigger
was not done and the slave didn't become master (was still and read-only
state). For some reason,
failover_command = '/store/pgdata/failover_
stream.sh %d %P %H %R postgresql.trigger'
was not executed.

Normally, I expect to see something like that in the logs:
2014-06-03 05:25:33 ERROR: pid 2970: make_persistent_db_connection:
connection to 10.0.90.11(5432) failed
2014-06-03 05:25:33 ERROR: pid 2970: health check failed. 0 th host
10.0.90.11 at port 5432 is down
2014-06-03 05:25:33 LOG:   pid 2970: set 0 th backend down status
2014-06-03 05:25:33 LOG:   pid 2970: wd_start_interlock: start interlocking
2014-06-03 05:25:33 LOG:   pid 2970: wd_assume_lock_holder: become a new
lock holder
2014-06-03 05:25:37 LOG:   pid 17705: connection received: host=10.0.97.1
port=57218
2014-06-03 05:25:37 ERROR: pid 17705: connect_inet_domain_socket:
getsockopt() detected error: Connection refused
2014-06-03 05:25:37 ERROR: pid 17705: connection to 10.0.90.11(5432) failed
2014-06-03 05:25:37 ERROR: pid 17705: new_connection: create_cp() failed
2014-06-03 05:25:37 LOG:   pid 17705: new_connection: do not failover
because fail_over_on_backend_error is off
2014-06-03 05:25:44 ERROR: pid 2970: wd_start_interlock: timed out
2014-06-03 05:25:44 LOG:   pid 2970: starting degeneration. shutdown host
10.0.90.11(5432)
2014-06-03 05:25:44 LOG:   pid 2970: Restart all children
2014-06-03 05:25:44 LOG:   pid 2970: execute command:
/store/pgdata/failover_stream.sh 0 0 10.0.90.12 /store/pgdata
postgresql.trigger
+ FAILED_NODE=0
+ OLD_MASTER=0
+ NEW_MASTER=10.0.90.12
+ PGDATA=/store/pgdata
+ TRIGGER_FILE=postgresql.trigger
+ '[' 0 = 0 ']'
+ /usr/bin/ssh -T 10.0.90.12 /bin/touch /store/pgdata/postgresql.trigger
+ exit 0
2014-06-03 05:25:44 LOG:   pid 2970: find_primary_node_repeatedly: waiting
for finding a primary node
2014-06-03 05:25:47 LOG:   pid 2970: find_primary_node: primary node id is 1
2014-06-03 05:25:47 LOG:   pid 2970: starting follow degeneration. shutdown
host 10.0.90.11(5432)
2014-06-03 05:25:47 LOG:   pid 2970: failover: 1 follow backends have been
degenerated
2014-06-03 05:25:47 LOG:   pid 26259: start triggering follow command.
2014-06-03 05:25:47 LOG:   pid 26259: execute command: echo $(date)
host:10.0.90.11, new master id:1, old master id:0 >>
/var/log/pgpool/master_failover.log
2014-06-03 05:25:47 LOG:   pid 2970: wd_end_interlock: end interlocking
2014-06-03 05:25:48 LOG:   pid 2970: failover: set new primary node: 1
2014-06-03 05:25:48 LOG:   pid 2970: failover: set new master node: 1
2014-06-03 05:25:48 LOG:   pid 2970: failover done. shutdown host
10.0.90.11(5432)


But in this case, when I shutdown 10.0.90.1 (master) and expect the
failover script to be run, I don't see it being run at all. See the log
snippet below:
2014-06-05 23:38:57 ERROR: pid 2970: health check failed. 0 th host
10.0.90.11 at port 5432 is down
2014-06-05 23:38:57 LOG:   pid 2970: health check retry sleep time: 1
second(s)
2014-06-05 23:38:58 ERROR: pid 2970: s_do_auth: expecting R got E
2014-06-05 23:38:58 ERROR: pid 2970: make_persistent_db_connection:
s_do_auth failed
2014-06-05 23:38:58 ERROR: pid 2970: health check failed. 0 th host
10.0.90.11 at port 5432 is down
2014-06-05 23:38:58 LOG:   pid 2970: health check retry sleep time: 1
second(s)
2014-06-05 23:38:59 ERROR: pid 2970: s_do_auth: expecting R got E
2014-06-05 23:38:59 ERROR: pid 2970: make_persistent_db_connection:
s_do_auth failed
2014-06-05 23:38:59 ERROR: pid 2970: health check failed. 0 th host
10.0.90.11 at port 5432 is down
2014-06-05 23:38:59 LOG:   pid 2970: health check retry sleep time: 1
second(s)
2014-06-05 23:39:00 LOG:   pid 2982: wd_send_response:
WD_STAND_FOR_LOCK_HOLDER received it
2014-06-05 23:39:00 ERROR: pid 2970: s_do_auth: expecting R got E
2014-06-05 23:39:00 ERROR: pid 2970: make_persistent_db_connection:
s_do_auth failed
2014-06-05 23:39:00 ERROR: pid 2970: health check failed. 0 th host
10.0.90.11 at port 5432 is down
2014-06-05 23:39:00 LOG:   pid 2970: set 0 th backend down status
2014-06-05 23:39:00 LOG:   pid 2970: wd_start_interlock: start interlocking
2014-06-05 23:39:01 LOG:   pid 2970: starting degeneration. shutdown host
10.0.90.11(5432)
2014-06-05 23:39:01 LOG:   pid 2970: Restart all children
2014-06-05 23:39:02 LOG:   pid 2970: find_primary_node_repeatedly: waiting
for finding a primary node
2014-06-05 23:39:12 LOG:   pid 2970: failover: no follow backends are
degenerated
2014-06-05 23:39:12 LOG:   pid 2970: wd_end_interlock: end interlocking
2014-06-05 23:39:12 LOG:   pid 2970: failover: set new primary node: -1
2014-06-05 23:39:12 LOG:   pid 2970: failover: set new master node: 1
2014-06-05 23:39:12 LOG:   pid 30220: connection received: host=10.0.97.1
port=45122
2014-06-05 23:39:13 LOG:   pid 30220: pool_send_and_wait: Error or notice
message from backend: : DB node id: 1 backend pid: 23115 statement: UPDATE
SCGroupRequestStatus SET Status = $1 WHERE Status = $2 and LastUpdateTime <
$3 and Owner is NULL message: cannot execute UPDATE in a read-only
transaction
2014-06-05 23:39:13 LOG:   pid 30326: do_child: failback event found.
restart myself.
2014-06-05 23:39:13 LOG:   pid 30456: do_child: failback event found.
restart myself.
2014-06-05 23:39:13 LOG:   pid 30468: do_child: failback event found.
restart myself.
2014-06-05 23:39:13 LOG:   pid 2970: failover done. shutdown host
10.0.90.11(5432)
2014-06-05 23:39:13 LOG:   pid 29758: worker process received restart
request
2014-06-05 23:39:14 LOG:   pid 29511: pcp child process received restart
request
2014-06-05 23:39:14 LOG:   pid 2970: PCP child 29511 exits with status 256
in failover()
2014-06-05 23:39:14 LOG:   pid 2970: fork a new PCP child pid 30470 in
failover()
2014-06-05 23:39:14 LOG:   pid 2970: worker child 29758 exits with status
256
2014-06-05 23:39:14 LOG:   pid 2970: fork a new worker child pid 30471
2014-06-05 23:39:25 LOG:   pid 30472: connection received: host=10.0.97.1
port=45123
2014-06-05 23:39:59 LOG:   pid 30466: connection received: host=10.0.70.1
port=50938
2014-06-05 23:40:02 LOG:   pid 30220: pool_send_and_wait: Error or notice
message from backend: : DB node id: 1 backend pid: 23115 statement: UPDATE
SCGroupRequestStatus SET Status = $1 WHERE Status = $2 and LastUpdateTime <
$3 and Owner is NULL message: cannot execute UPDATE in a read-only
transaction
2014-06-05 23:41:02 LOG:   pid 30220: pool_send_and_wait: Error or notice
message from backend: : DB node id: 1 backend pid: 23115 statement: UPDATE
SCGroupRequestStatus SET Status = $1 WHERE Status = $2 and LastUpdateTime <
$3 and Owner is NULL message: cannot execute UPDATE in a read-only
transaction


As you can see the node was detached from the cluster correctly, but the
failover script was NOT executed? So, the trigger was not done, and the
slave was not promoted to master.

Why is that? Please see the configurations in the previous email.

Also, this is the failover script:

#!/bin/bash -x
FAILED_NODE=$1         # %d
OLD_MASTER=$2          # %P
NEW_MASTER=$3          # %H
PGDATA=$4              # %R
TRIGGER_FILE=$5

SSH_OPTIONS='-T -o StrictHostKeyChecking=no'

if [ $FAILED_NODE = $OLD_MASTER ]; then
        /usr/bin/ssh $SSH_OPTIONS $NEW_MASTER /bin/touch
$PGDATA/$TRIGGER_FILE
fi;
exit 0;


Sincerely,
Alexandru



On Sat, Jun 21, 2014 at 10:49 PM, Alexandru Cardaniuc <cardaniuc at gmail.com>
wrote:

>
> Hi All,
>
> Need some validation of the configuration that I did for the cluster of
> 2 pgpools with a cluster of 2 PostgreSQL 9.3 databases. I went through
> all the documentation, but in some places it's not completely clear, so
> I am not sure if I misunderstood some of the things.
>
> Here's the info.
>
> Pgpool version: 3.3.3 and using watchdog (2 pgpools: 1 primary, 1
> standby)
>
> PostgreSQL version 9.3 (cluster of 2 databases with streaming
> replication and hot standby configured).
>
> This is my relevant PostgreSQL 9.3 config:
> ------------------------------------------
> listen_addresses = '*'
> port = 5432
> max_connections = 550
> superuser_reserved_connections = 3
> wal_level = hot_standby
> checkpoint_segments = 10
> checkpoint_completion_target = 0.9
> max_wal_senders = 10
> wal_keep_segments = 100
> hot_standby = on
> ------------------------------------------
> The documentation states that:
> "To enable read-only queries on a standby server, wal_level must be set
> to hot_standby on the primary, and hot_standby must be enabled in the
> standby."
>
> I use the same config above on both primary database and secondary with
> the assumption that 'wal_level = hot_standby' is ignored on the slave
> and 'hot_standby = on' is ignored on the master.
>
> Is that a correct assumption? That assumption shouldn't create any
> issues during pgpool/postgres failover?
>
>
> Also, about the pgpool config. This is the pgpool config for the 1st
> pgpool:
> --------------------------------------------------------------------
> listen_addresses = '*'
> port = 9999
> socket_dir = '/var/run/pgpool'
> pcp_port = 9898
> pcp_socket_dir = '/var/run/pgpool'
> num_init_children = 250
> child_life_time = 300
> child_max_connections = 0
> client_idle_limit = 0
> enable_pool_hba = false
> pool_passwd = 'pool_passwd'
> authentication_timeout = 60
> print_timestamp = true
> log_connections = true
> log_hostname = false
> log_statement = true
> log_per_node_statement = false
> debug_level = 0
> pid_file_name = '/var/run/pgpool/pgpool.pid'
> logdir = '/var/run/pgpool'
> connection_cache = false
> health_check_period = 60
> health_check_timeout = 15
> health_check_user = 'postgres'
> health_check_password = 'password'
> health_check_max_retries = 3
> health_check_retry_delay = 1
> search_primary_node_timeout = 10
> failover_command = '/store/pgdata/failover_stream.sh %d %P %H %R
> postgresql.trigger'
> failback_command = 'echo $(date) host:%h, new master id:%m, old master
> id:%M >> /var/log/pgpool/failback.log'
> follow_master_command = 'echo $(date) host:%h, new master id:%m, old
> master id:%M >> /var/log/pgpool/master_failover.log'
> fail_over_on_backend_error = false
> ignore_leading_white_space = true
> backend_hostname0 = '10.0.90.11'
> backend_port0 = 5432
> backend_weight0 = 1
> backend_data_directory0 = '/store/pgdata'
> backend_flag0 = 'ALLOW_TO_FAILOVER'
> backend_hostname1 = '10.0.90.12'
> backend_port1 = 5432
> backend_weight1 = 1
> backend_data_directory1 = '/store/pgdata'
> backend_flag1 = 'ALLOW_TO_FAILOVER'
> ssl = false
> max_pool = 1
> connection_life_time = 0
> reset_query_list = 'ABORT; DISCARD ALL'
> replication_mode = false
> replication_stop_on_mismatch = true
> replicate_select = false
> insert_lock = true
> recovery_user = 'postgres'
> recovery_password = 'password'
> recovery_1st_stage_command = 'base_backup.sh'
> recovery_2nd_stage_command = ''
> recovery_timeout = 120
> client_idle_limit_in_recovery = 1
> lobj_lock_table = ''
> master_slave_mode = true
> load_balance_mode = true
> master_slave_sub_mode = 'stream'
> delay_threshold = 0
> sr_check_period = 0
> sr_check_user = 'postgres'
> sr_check_password = 'password'
> log_standby_delay = 'none'
> parallel_mode = false
> system_db_hostname = 'localhost'
> system_db_port = 5432
> system_db_dbname = 'pgpool'
> system_db_schema = 'pgpool_catalog'
> system_db_user = 'pgpool'
> system_db_password = ''
> pgpool2_hostname = ''
> memory_cache_enabled = off
> use_watchdog = on
> wd_hostname = '10.0.90.11'
> wd_port = 9000
> delegate_IP = '10.0.90.1'
> ifconfig_path = '/store/pgdata/sbin'
> if_up_cmd = 'ifconfig eth0:0 inet $_IP_$ netmask 255.255.0.0'
> if_down_cmd = 'ifconfig eth0:0 down'
> arping_path = '/store/pgdata/sbin'
> arping_cmd = 'arping -U $_IP_$ -w 1'
> wd_lifecheck_method = 'heartbeat'
> wd_interval = 10
> wd_heartbeat_port = 9694
> wd_heartbeat_keepalive = 2
> heartbeat_destination0 = '10.0.90.12'
> heartbeat_destination_port0 = 9694
> other_pgpool_hostname0 = '10.0.90.12'
> other_pgpool_port0 = 9999
> other_wd_port0 = 9000
> --------------------------------------------------------------------
>
> second pgpool config file:
> --------------------------------------------------------------------
> listen_addresses = '*'
> port = 9999
> socket_dir = '/var/run/pgpool'
> pcp_port = 9898
> pcp_socket_dir = '/var/run/pgpool'
> num_init_children = 250
> child_life_time = 300
> child_max_connections = 0
> client_idle_limit = 0
> enable_pool_hba = false
> pool_passwd = 'pool_passwd'
> authentication_timeout = 60
> print_timestamp = true
> log_connections = true
> log_hostname = false
> log_statement = true
> log_per_node_statement = false
> debug_level = 0
> pid_file_name = '/var/run/pgpool/pgpool.pid'
> logdir = '/var/run/pgpool'
> connection_cache = false
> health_check_period = 60
> health_check_timeout = 15
> health_check_user = 'postgres'
> health_check_password = 'password'
> health_check_max_retries = 3
> health_check_retry_delay = 1
> search_primary_node_timeout = 10
> failover_command = '/store/pgdata/failover_stream.sh %d %P %H %R
> postgresql.trigger'
> failback_command = 'echo $(date) host:%h, new master id:%m, old master
> id:%M >> /var/log/pgpool/failback.log'
> follow_master_command = 'echo $(date) host:%h, new master id:%m, old
> master id:%M >> /var/log/pgpool/master_failover.log'
> fail_over_on_backend_error = false
> ignore_leading_white_space = true
> backend_hostname0 = '10.0.90.11'
> backend_port0 = 5432
> backend_weight0 = 1
> backend_data_directory0 = '/store/pgdata'
> backend_flag0 = 'ALLOW_TO_FAILOVER'
> backend_hostname1 = '10.0.90.12'
> backend_port1 = 5432
> backend_weight1 = 1
> backend_data_directory1 = '/store/pgdata'
> backend_flag1 = 'ALLOW_TO_FAILOVER'
> ssl = false
> max_pool = 1
> connection_life_time = 0
> reset_query_list = 'ABORT; DISCARD ALL'
> replication_mode = false
> replication_stop_on_mismatch = true
> replicate_select = false
> insert_lock = true
> recovery_user = 'postgres'
> recovery_password = 'password'
> recovery_1st_stage_command = 'base_backup.sh'
> recovery_2nd_stage_command = ''
> recovery_timeout = 120
> client_idle_limit_in_recovery = 1
> lobj_lock_table = ''
> master_slave_mode = true
> load_balance_mode = true
> master_slave_sub_mode = 'stream'
> delay_threshold = 0
> sr_check_period = 0
> sr_check_user = 'postgres'
> sr_check_password = 'password'
> log_standby_delay = 'none'
> parallel_mode = false
> system_db_hostname = 'localhost'
> system_db_port = 5432
> system_db_dbname = 'pgpool'
> system_db_schema = 'pgpool_catalog'
> system_db_user = 'pgpool'
> system_db_password = ''
> pgpool2_hostname = ''
> memory_cache_enabled = off
> use_watchdog = on
> wd_hostname = '10.0.90.12'
> wd_port = 9000
> delegate_IP = '10.0.90.1'
> ifconfig_path = '/store/pgdata/sbin'
> if_up_cmd = 'ifconfig eth0:0 inet $_IP_$ netmask 255.255.0.0'
> if_down_cmd = 'ifconfig eth0:0 down'
> arping_path = '/store/pgdata/sbin'
> arping_cmd = 'arping -U $_IP_$ -w 1'
> wd_lifecheck_method = 'heartbeat'
> wd_interval = 10
> wd_heartbeat_port = 9694
> wd_heartbeat_keepalive = 2
> heartbeat_destination0 = '10.0.90.11'
> heartbeat_destination_port0 = 9694
> other_pgpool_hostname0 = '10.0.90.11'
> other_pgpool_port0 = 9999
> other_wd_port0 = 9000
> --------------------------------------------------------------------
>
> Anything wrong or missing in the configs? I need a cluster of pgpools
> with watchdog and healthchecks monitoring a cluster of postgres
> databases with streaming replication setup.
>
> Any help in review or suggestions based on your knowledge or experience
> will be appreciated.
>
> Sincerely,
> Alexandru
>
> --
> "A journey of a thousand miles begins with a single step."
> - Ancient Proverb
>



-- 
Sincerely yours,
Alexandru Cardaniuc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20140621/fd00f56e/attachment-0001.html>


More information about the pgpool-general mailing list