[pgpool-general: 2979] Re: Pgpool cluster with PostgreSQL 9.3

Yugo Nagata nagata at sraoss.co.jp
Thu Jun 26 20:14:17 JST 2014


Hi,

On Sat, 21 Jun 2014 23:05:12 -0700
Alexandru Cardaniuc <cardaniuc at gmail.com> wrote:

> Also to add, for some reason with this configuration and setup, while
> testing failover and failing the master database, pgpool correctly
> identified the master as failed and detached
>  it from the cluster. But all writes failed to the slave since the trigger
> was not done and the slave didn't become master (was still and read-only
> state). For some reason,
> failover_command = '/store/pgdata/failover_
> stream.sh %d %P %H %R postgresql.trigger'
> was not executed.
> 
> Normally, I expect to see something like that in the logs:
> 2014-06-03 05:25:33 ERROR: pid 2970: make_persistent_db_connection:
> connection to 10.0.90.11(5432) failed
> 2014-06-03 05:25:33 ERROR: pid 2970: health check failed. 0 th host
> 10.0.90.11 at port 5432 is down
> 2014-06-03 05:25:33 LOG:   pid 2970: set 0 th backend down status
> 2014-06-03 05:25:33 LOG:   pid 2970: wd_start_interlock: start interlocking
> 2014-06-03 05:25:33 LOG:   pid 2970: wd_assume_lock_holder: become a new
> lock holder
> 2014-06-03 05:25:37 LOG:   pid 17705: connection received: host=10.0.97.1
> port=57218
> 2014-06-03 05:25:37 ERROR: pid 17705: connect_inet_domain_socket:
> getsockopt() detected error: Connection refused
> 2014-06-03 05:25:37 ERROR: pid 17705: connection to 10.0.90.11(5432) failed
> 2014-06-03 05:25:37 ERROR: pid 17705: new_connection: create_cp() failed
> 2014-06-03 05:25:37 LOG:   pid 17705: new_connection: do not failover
> because fail_over_on_backend_error is off
> 2014-06-03 05:25:44 ERROR: pid 2970: wd_start_interlock: timed out
> 2014-06-03 05:25:44 LOG:   pid 2970: starting degeneration. shutdown host
> 10.0.90.11(5432)
> 2014-06-03 05:25:44 LOG:   pid 2970: Restart all children
> 2014-06-03 05:25:44 LOG:   pid 2970: execute command:
> /store/pgdata/failover_stream.sh 0 0 10.0.90.12 /store/pgdata
> postgresql.trigger
> + FAILED_NODE=0
> + OLD_MASTER=0
> + NEW_MASTER=10.0.90.12
> + PGDATA=/store/pgdata
> + TRIGGER_FILE=postgresql.trigger
> + '[' 0 = 0 ']'
> + /usr/bin/ssh -T 10.0.90.12 /bin/touch /store/pgdata/postgresql.trigger
> + exit 0
> 2014-06-03 05:25:44 LOG:   pid 2970: find_primary_node_repeatedly: waiting
> for finding a primary node
> 2014-06-03 05:25:47 LOG:   pid 2970: find_primary_node: primary node id is 1
> 2014-06-03 05:25:47 LOG:   pid 2970: starting follow degeneration. shutdown
> host 10.0.90.11(5432)
> 2014-06-03 05:25:47 LOG:   pid 2970: failover: 1 follow backends have been
> degenerated
> 2014-06-03 05:25:47 LOG:   pid 26259: start triggering follow command.
> 2014-06-03 05:25:47 LOG:   pid 26259: execute command: echo $(date)
> host:10.0.90.11, new master id:1, old master id:0 >>
> /var/log/pgpool/master_failover.log
> 2014-06-03 05:25:47 LOG:   pid 2970: wd_end_interlock: end interlocking
> 2014-06-03 05:25:48 LOG:   pid 2970: failover: set new primary node: 1
> 2014-06-03 05:25:48 LOG:   pid 2970: failover: set new master node: 1
> 2014-06-03 05:25:48 LOG:   pid 2970: failover done. shutdown host
> 10.0.90.11(5432)
> 
> 
> But in this case, when I shutdown 10.0.90.1 (master) and expect the
> failover script to be run, I don't see it being run at all. See the log
> snippet below:

Hmm... this is odd that failover script isn't excuted.
Did you shutdowned 10.0.90.11 by power-off or pg_ctl stop?
And, is the log of pgpool on 10.0.90.12?

> 2014-06-05 23:38:57 ERROR: pid 2970: health check failed. 0 th host
> 10.0.90.11 at port 5432 is down
> 2014-06-05 23:38:57 LOG:   pid 2970: health check retry sleep time: 1
> second(s)
> 2014-06-05 23:38:58 ERROR: pid 2970: s_do_auth: expecting R got E
> 2014-06-05 23:38:58 ERROR: pid 2970: make_persistent_db_connection:
> s_do_auth failed
> 2014-06-05 23:38:58 ERROR: pid 2970: health check failed. 0 th host
> 10.0.90.11 at port 5432 is down
> 2014-06-05 23:38:58 LOG:   pid 2970: health check retry sleep time: 1
> second(s)
> 2014-06-05 23:38:59 ERROR: pid 2970: s_do_auth: expecting R got E
> 2014-06-05 23:38:59 ERROR: pid 2970: make_persistent_db_connection:
> s_do_auth failed
> 2014-06-05 23:38:59 ERROR: pid 2970: health check failed. 0 th host
> 10.0.90.11 at port 5432 is down
> 2014-06-05 23:38:59 LOG:   pid 2970: health check retry sleep time: 1
> second(s)
> 2014-06-05 23:39:00 LOG:   pid 2982: wd_send_response:
> WD_STAND_FOR_LOCK_HOLDER received it
> 2014-06-05 23:39:00 ERROR: pid 2970: s_do_auth: expecting R got E
> 2014-06-05 23:39:00 ERROR: pid 2970: make_persistent_db_connection:
> s_do_auth failed
> 2014-06-05 23:39:00 ERROR: pid 2970: health check failed. 0 th host
> 10.0.90.11 at port 5432 is down
> 2014-06-05 23:39:00 LOG:   pid 2970: set 0 th backend down status
> 2014-06-05 23:39:00 LOG:   pid 2970: wd_start_interlock: start interlocking
> 2014-06-05 23:39:01 LOG:   pid 2970: starting degeneration. shutdown host
> 10.0.90.11(5432)
> 2014-06-05 23:39:01 LOG:   pid 2970: Restart all children
> 2014-06-05 23:39:02 LOG:   pid 2970: find_primary_node_repeatedly: waiting
> for finding a primary node
> 2014-06-05 23:39:12 LOG:   pid 2970: failover: no follow backends are
> degenerated
> 2014-06-05 23:39:12 LOG:   pid 2970: wd_end_interlock: end interlocking
> 2014-06-05 23:39:12 LOG:   pid 2970: failover: set new primary node: -1
> 2014-06-05 23:39:12 LOG:   pid 2970: failover: set new master node: 1
> 2014-06-05 23:39:12 LOG:   pid 30220: connection received: host=10.0.97.1
> port=45122
> 2014-06-05 23:39:13 LOG:   pid 30220: pool_send_and_wait: Error or notice
> message from backend: : DB node id: 1 backend pid: 23115 statement: UPDATE
> SCGroupRequestStatus SET Status = $1 WHERE Status = $2 and LastUpdateTime <
> $3 and Owner is NULL message: cannot execute UPDATE in a read-only
> transaction
> 2014-06-05 23:39:13 LOG:   pid 30326: do_child: failback event found.
> restart myself.
> 2014-06-05 23:39:13 LOG:   pid 30456: do_child: failback event found.
> restart myself.
> 2014-06-05 23:39:13 LOG:   pid 30468: do_child: failback event found.
> restart myself.
> 2014-06-05 23:39:13 LOG:   pid 2970: failover done. shutdown host
> 10.0.90.11(5432)
> 2014-06-05 23:39:13 LOG:   pid 29758: worker process received restart
> request
> 2014-06-05 23:39:14 LOG:   pid 29511: pcp child process received restart
> request
> 2014-06-05 23:39:14 LOG:   pid 2970: PCP child 29511 exits with status 256
> in failover()
> 2014-06-05 23:39:14 LOG:   pid 2970: fork a new PCP child pid 30470 in
> failover()
> 2014-06-05 23:39:14 LOG:   pid 2970: worker child 29758 exits with status
> 256
> 2014-06-05 23:39:14 LOG:   pid 2970: fork a new worker child pid 30471
> 2014-06-05 23:39:25 LOG:   pid 30472: connection received: host=10.0.97.1
> port=45123
> 2014-06-05 23:39:59 LOG:   pid 30466: connection received: host=10.0.70.1
> port=50938
> 2014-06-05 23:40:02 LOG:   pid 30220: pool_send_and_wait: Error or notice
> message from backend: : DB node id: 1 backend pid: 23115 statement: UPDATE
> SCGroupRequestStatus SET Status = $1 WHERE Status = $2 and LastUpdateTime <
> $3 and Owner is NULL message: cannot execute UPDATE in a read-only
> transaction
> 2014-06-05 23:41:02 LOG:   pid 30220: pool_send_and_wait: Error or notice
> message from backend: : DB node id: 1 backend pid: 23115 statement: UPDATE
> SCGroupRequestStatus SET Status = $1 WHERE Status = $2 and LastUpdateTime <
> $3 and Owner is NULL message: cannot execute UPDATE in a read-only
> transaction
> 
> 
> As you can see the node was detached from the cluster correctly, but the
> failover script was NOT executed? So, the trigger was not done, and the
> slave was not promoted to master.
> 
> Why is that? Please see the configurations in the previous email.
> 
> Also, this is the failover script:
> 
> #!/bin/bash -x
> FAILED_NODE=$1         # %d
> OLD_MASTER=$2          # %P
> NEW_MASTER=$3          # %H
> PGDATA=$4              # %R
> TRIGGER_FILE=$5
> 
> SSH_OPTIONS='-T -o StrictHostKeyChecking=no'
> 
> if [ $FAILED_NODE = $OLD_MASTER ]; then
>         /usr/bin/ssh $SSH_OPTIONS $NEW_MASTER /bin/touch
> $PGDATA/$TRIGGER_FILE
> fi;
> exit 0;
> 
> 
> Sincerely,
> Alexandru
> 
> 
> 
> On Sat, Jun 21, 2014 at 10:49 PM, Alexandru Cardaniuc <cardaniuc at gmail.com>
> wrote:
> 
> >
> > Hi All,
> >
> > Need some validation of the configuration that I did for the cluster of
> > 2 pgpools with a cluster of 2 PostgreSQL 9.3 databases. I went through
> > all the documentation, but in some places it's not completely clear, so
> > I am not sure if I misunderstood some of the things.
> >
> > Here's the info.
> >
> > Pgpool version: 3.3.3 and using watchdog (2 pgpools: 1 primary, 1
> > standby)
> >
> > PostgreSQL version 9.3 (cluster of 2 databases with streaming
> > replication and hot standby configured).
> >
> > This is my relevant PostgreSQL 9.3 config:
> > ------------------------------------------
> > listen_addresses = '*'
> > port = 5432
> > max_connections = 550
> > superuser_reserved_connections = 3
> > wal_level = hot_standby
> > checkpoint_segments = 10
> > checkpoint_completion_target = 0.9
> > max_wal_senders = 10
> > wal_keep_segments = 100
> > hot_standby = on
> > ------------------------------------------
> > The documentation states that:
> > "To enable read-only queries on a standby server, wal_level must be set
> > to hot_standby on the primary, and hot_standby must be enabled in the
> > standby."
> >
> > I use the same config above on both primary database and secondary with
> > the assumption that 'wal_level = hot_standby' is ignored on the slave
> > and 'hot_standby = on' is ignored on the master.
> >
> > Is that a correct assumption? That assumption shouldn't create any
> > issues during pgpool/postgres failover?
> >
> >
> > Also, about the pgpool config. This is the pgpool config for the 1st
> > pgpool:
> > --------------------------------------------------------------------
> > listen_addresses = '*'
> > port = 9999
> > socket_dir = '/var/run/pgpool'
> > pcp_port = 9898
> > pcp_socket_dir = '/var/run/pgpool'
> > num_init_children = 250
> > child_life_time = 300
> > child_max_connections = 0
> > client_idle_limit = 0
> > enable_pool_hba = false
> > pool_passwd = 'pool_passwd'
> > authentication_timeout = 60
> > print_timestamp = true
> > log_connections = true
> > log_hostname = false
> > log_statement = true
> > log_per_node_statement = false
> > debug_level = 0
> > pid_file_name = '/var/run/pgpool/pgpool.pid'
> > logdir = '/var/run/pgpool'
> > connection_cache = false
> > health_check_period = 60
> > health_check_timeout = 15
> > health_check_user = 'postgres'
> > health_check_password = 'password'
> > health_check_max_retries = 3
> > health_check_retry_delay = 1
> > search_primary_node_timeout = 10
> > failover_command = '/store/pgdata/failover_stream.sh %d %P %H %R
> > postgresql.trigger'
> > failback_command = 'echo $(date) host:%h, new master id:%m, old master
> > id:%M >> /var/log/pgpool/failback.log'
> > follow_master_command = 'echo $(date) host:%h, new master id:%m, old
> > master id:%M >> /var/log/pgpool/master_failover.log'
> > fail_over_on_backend_error = false
> > ignore_leading_white_space = true
> > backend_hostname0 = '10.0.90.11'
> > backend_port0 = 5432
> > backend_weight0 = 1
> > backend_data_directory0 = '/store/pgdata'
> > backend_flag0 = 'ALLOW_TO_FAILOVER'
> > backend_hostname1 = '10.0.90.12'
> > backend_port1 = 5432
> > backend_weight1 = 1
> > backend_data_directory1 = '/store/pgdata'
> > backend_flag1 = 'ALLOW_TO_FAILOVER'
> > ssl = false
> > max_pool = 1
> > connection_life_time = 0
> > reset_query_list = 'ABORT; DISCARD ALL'
> > replication_mode = false
> > replication_stop_on_mismatch = true
> > replicate_select = false
> > insert_lock = true
> > recovery_user = 'postgres'
> > recovery_password = 'password'
> > recovery_1st_stage_command = 'base_backup.sh'
> > recovery_2nd_stage_command = ''
> > recovery_timeout = 120
> > client_idle_limit_in_recovery = 1
> > lobj_lock_table = ''
> > master_slave_mode = true
> > load_balance_mode = true
> > master_slave_sub_mode = 'stream'
> > delay_threshold = 0
> > sr_check_period = 0
> > sr_check_user = 'postgres'
> > sr_check_password = 'password'
> > log_standby_delay = 'none'
> > parallel_mode = false
> > system_db_hostname = 'localhost'
> > system_db_port = 5432
> > system_db_dbname = 'pgpool'
> > system_db_schema = 'pgpool_catalog'
> > system_db_user = 'pgpool'
> > system_db_password = ''
> > pgpool2_hostname = ''
> > memory_cache_enabled = off
> > use_watchdog = on
> > wd_hostname = '10.0.90.11'
> > wd_port = 9000
> > delegate_IP = '10.0.90.1'
> > ifconfig_path = '/store/pgdata/sbin'
> > if_up_cmd = 'ifconfig eth0:0 inet $_IP_$ netmask 255.255.0.0'
> > if_down_cmd = 'ifconfig eth0:0 down'
> > arping_path = '/store/pgdata/sbin'
> > arping_cmd = 'arping -U $_IP_$ -w 1'
> > wd_lifecheck_method = 'heartbeat'
> > wd_interval = 10
> > wd_heartbeat_port = 9694
> > wd_heartbeat_keepalive = 2
> > heartbeat_destination0 = '10.0.90.12'
> > heartbeat_destination_port0 = 9694
> > other_pgpool_hostname0 = '10.0.90.12'
> > other_pgpool_port0 = 9999
> > other_wd_port0 = 9000
> > --------------------------------------------------------------------
> >
> > second pgpool config file:
> > --------------------------------------------------------------------
> > listen_addresses = '*'
> > port = 9999
> > socket_dir = '/var/run/pgpool'
> > pcp_port = 9898
> > pcp_socket_dir = '/var/run/pgpool'
> > num_init_children = 250
> > child_life_time = 300
> > child_max_connections = 0
> > client_idle_limit = 0
> > enable_pool_hba = false
> > pool_passwd = 'pool_passwd'
> > authentication_timeout = 60
> > print_timestamp = true
> > log_connections = true
> > log_hostname = false
> > log_statement = true
> > log_per_node_statement = false
> > debug_level = 0
> > pid_file_name = '/var/run/pgpool/pgpool.pid'
> > logdir = '/var/run/pgpool'
> > connection_cache = false
> > health_check_period = 60
> > health_check_timeout = 15
> > health_check_user = 'postgres'
> > health_check_password = 'password'
> > health_check_max_retries = 3
> > health_check_retry_delay = 1
> > search_primary_node_timeout = 10
> > failover_command = '/store/pgdata/failover_stream.sh %d %P %H %R
> > postgresql.trigger'
> > failback_command = 'echo $(date) host:%h, new master id:%m, old master
> > id:%M >> /var/log/pgpool/failback.log'
> > follow_master_command = 'echo $(date) host:%h, new master id:%m, old
> > master id:%M >> /var/log/pgpool/master_failover.log'
> > fail_over_on_backend_error = false
> > ignore_leading_white_space = true
> > backend_hostname0 = '10.0.90.11'
> > backend_port0 = 5432
> > backend_weight0 = 1
> > backend_data_directory0 = '/store/pgdata'
> > backend_flag0 = 'ALLOW_TO_FAILOVER'
> > backend_hostname1 = '10.0.90.12'
> > backend_port1 = 5432
> > backend_weight1 = 1
> > backend_data_directory1 = '/store/pgdata'
> > backend_flag1 = 'ALLOW_TO_FAILOVER'
> > ssl = false
> > max_pool = 1
> > connection_life_time = 0
> > reset_query_list = 'ABORT; DISCARD ALL'
> > replication_mode = false
> > replication_stop_on_mismatch = true
> > replicate_select = false
> > insert_lock = true
> > recovery_user = 'postgres'
> > recovery_password = 'password'
> > recovery_1st_stage_command = 'base_backup.sh'
> > recovery_2nd_stage_command = ''
> > recovery_timeout = 120
> > client_idle_limit_in_recovery = 1
> > lobj_lock_table = ''
> > master_slave_mode = true
> > load_balance_mode = true
> > master_slave_sub_mode = 'stream'
> > delay_threshold = 0
> > sr_check_period = 0
> > sr_check_user = 'postgres'
> > sr_check_password = 'password'
> > log_standby_delay = 'none'
> > parallel_mode = false
> > system_db_hostname = 'localhost'
> > system_db_port = 5432
> > system_db_dbname = 'pgpool'
> > system_db_schema = 'pgpool_catalog'
> > system_db_user = 'pgpool'
> > system_db_password = ''
> > pgpool2_hostname = ''
> > memory_cache_enabled = off
> > use_watchdog = on
> > wd_hostname = '10.0.90.12'
> > wd_port = 9000
> > delegate_IP = '10.0.90.1'
> > ifconfig_path = '/store/pgdata/sbin'
> > if_up_cmd = 'ifconfig eth0:0 inet $_IP_$ netmask 255.255.0.0'
> > if_down_cmd = 'ifconfig eth0:0 down'
> > arping_path = '/store/pgdata/sbin'
> > arping_cmd = 'arping -U $_IP_$ -w 1'
> > wd_lifecheck_method = 'heartbeat'
> > wd_interval = 10
> > wd_heartbeat_port = 9694
> > wd_heartbeat_keepalive = 2
> > heartbeat_destination0 = '10.0.90.11'
> > heartbeat_destination_port0 = 9694
> > other_pgpool_hostname0 = '10.0.90.11'
> > other_pgpool_port0 = 9999
> > other_wd_port0 = 9000
> > --------------------------------------------------------------------
> >
> > Anything wrong or missing in the configs? I need a cluster of pgpools
> > with watchdog and healthchecks monitoring a cluster of postgres
> > databases with streaming replication setup.
> >
> > Any help in review or suggestions based on your knowledge or experience
> > will be appreciated.
> >
> > Sincerely,
> > Alexandru
> >
> > --
> > "A journey of a thousand miles begins with a single step."
> > - Ancient Proverb
> >
> 
> 
> 
> -- 
> Sincerely yours,
> Alexandru Cardaniuc


-- 
Yugo Nagata <nagata at sraoss.co.jp>


More information about the pgpool-general mailing list