[pgpool-general: 6619] Re: After shutdown or poweroff commands pgpool on slave machine failed to start

Bo Peng pengbo at sraoss.co.jp
Fri Jun 28 15:06:44 JST 2019


Hi, 

I tried the same procedure with you to shutdown the primary postgres server,
but I could not reproduce the issue reported by you.

I setup 2 servers like below:

(1) server1
PostgreSQL primary
Pgpool-II master

(2) server2
PostgreSQL standby
Pgpool-II standby

After I shutdown "server1", Pgpool-II on server2 performs failover and promote PostgreSQL on server2 to new primary.


Still I could not find the reason why failover wasn't performed.
I checked you log and it seems that watchdog received failover command request,
but failover wasn't performed.

Could you check if some log output in "AAA_failover_called.log" file?

=============
2019-06-25 07:34:30: pid 5390: LOG:  failed to connect to PostgreSQL server on "172.18.255.41:5432", getsockopt() detected error "No route to host"
2019-06-25 07:34:30: pid 5390: LOG:  received degenerate backend request for node_id: 0 from pid [5390]
2019-06-25 07:34:30: pid 5407: LOG:  failed to connect to PostgreSQL zserver on "172.18.255.41:5432", getsockopt() detected error "No route to host"
2019-06-25 07:34:30: pid 5407: LOG:  received degenerate backend request for node_id: 0 from pid [5407]
2019-06-25 07:34:30: pid 3375: LOG:  new IPC connection received
2019-06-25 07:34:30: pid 3375: LOG:  new IPC connection received
2019-06-25 07:34:30: pid 3375: LOG:  watchdog received the failover command from local pgpool-II on IPC interface
2019-06-25 07:34:30: pid 3375: LOG:  watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface
=============

Thank you.

On Thu, 27 Jun 2019 07:12:33 +0000
Avi Weinberg <AviW at gilat.com> wrote:

> 
> Hi Bo Peng,
> 
> Attached please find the zipped log file.
> 
> When I do poweroff I get to all sort of strange things.  Now I was not able to attach a node.  When I executed the command I got "t", but it did not do it.  When I try to detach the node I get " ERROR:  invalid degenerate backend request, node id : 1 status: [3] is not valid for failover"
> It is very hard to get the two nodes to work properly after a poweroff.  Your input is most welcome!
> 
> Thanks
> Avi
> 
> 
> 
> -----Original Message-----
> From: Bo Peng [mailto:pengbo at sraoss.co.jp]
> Sent: Tuesday, June 25, 2019 11:30 AM
> To: Avi Weinberg <AviW at gilat.com>
> Cc: pgpool-general at pgpool.net
> Subject: Re: [pgpool-general: 6603] After shutdown or poweroff commands pgpool on slave machine failed to start
> 
> Hi,
> 
> On Mon, 24 Jun 2019 08:40:03 +0000
> Avi Weinberg <AviW at gilat.com> wrote:
> 
> > Hi,
> >
> > Thank you for your reply.  I checked our pgpool configuration (attached) and we already have health check "on" and fail_over_on_backend_error "on".  What could possibly be the cause of pgpool not accepting connections when the old master was poweroff and the slave server was promoted to master.  We are able to access the database with port 5432 but not with port 9999.  Restarting pgpool does not fix the problem.  Only after few minutes probably some timeout exceeds its limit and we are able to connect with port 9999.
> 
> Could you share the full log after shutting down the master server to confirm if a failover was executed?
> 
> > Even if this situation cannot be avoided, I would appreciate a way to fix it with external script since I already have a script that checks the health of my database and the script is aware of the pgpool connection problem. However, I did not find a command to add to my script that can cause pgpool to accept connection immediately and not just after few minutes.
> >
> > Your input is most welcome,
> > Avi
> >
> >
> >
> > fail_over_on_backend_error = on
> >                                    # backend communication socket
> > fails
> >
> > health_check_period = 10
> >                                    # Health check period
> >                                    # Disabled (0) by default
> > health_check_timeout = 20
> >                                    # Health check timeout
> >                                    # 0 means no timeout
> > health_check_user = 'postgres'
> >                                    # Health check user
> > health_check_password = 'mypwd'
> >                                    # Password for health check user
> > health_check_database = ''
> >                                    # Database name for health check.
> > If '', tries 'postgres' frist, health_check_max_retries = 0
> >                                    # Maximum number of times to retry a failed health check before giving up.
> > health_check_retry_delay = 1
> >                                    # Amount of time to wait (in seconds) between retries.
> > connect_timeout = 10000
> >                                    # Timeout value in milliseconds before giving up to connect to backend.
> >                                                                    # Default is 10000 ms (10 second). Flaky network user may want to increase
> >                                                                    # the value. 0 means no timeout.
> >                                                                    # Note that this value is not only used for health check,
> >                                                                    # but also for ordinary conection to backend.
> >
> > -----Original Message-----
> > From: Bo Peng [mailto:pengbo at sraoss.co.jp]
> > Sent: Sunday, June 23, 2019 3:35 PM
> > To: Avi Weinberg <AviW at gilat.com>
> > Cc: pgpool-general at pgpool.net
> > Subject: Re: [pgpool-general: 6603] After shutdown or poweroff
> > commands pgpool on slave machine failed to start
> >
> > Hello,
> >
> > On Thu, 20 Jun 2019 13:58:17 +0000
> > Avi Weinberg <AviW at gilat.com> wrote:
> >
> > > Hi experts
> > >
> > > We are using pgpool version 3.6.9 in master slave steaming replication setup (postgres 9.6.7).  When we issue a poweroff or shutdown command on the machine with pgpool and database was master, in many cases pgpool status command from the slave machine will show "connection to host:"172.18.255.11:5432" failed" where 172.18.255.11 is the old master.  My search_primary_node_timeout is 10, but even after 3 minutes it still tries the old master IP.
> >
> > "search_primary_node_timeout" parameter is the maximum amount of time to search for the primary node when failover occurs.
> >
> > But the log below shows that failover was not triggered.
> > Therefore, even though you set "search_primary_node_timeout", it did not work.
> >
> > > May 28 10:14:47 h1-nms pgpool[9520]: 2019-05-28 10:14:47: pid 11207:
> > > LOG:  trying connecting to PostgreSQL server on "172.18.255.11:5432" by INET socket May 28 10:14:47 h1-nms pgpool[9520]: 2019-05-28 10:14:47: pid 11207: DETAIL:  timed out. retrying...
> > > May 28 10:14:47 h1-nms pgpool[9520]: 2019-05-28 10:14:47: pid 9520:
> > > LOG:  failed to connect to PostgreSQL server on
> > > "172.18.255.11:5432", timed out May 28 10:14:47 h1-nms pgpool[9520]: 2019-05-28 10:14:47:
> > > pid 9520: ERROR:  failed to make persistent db connection May 28
> > > 10:14:47 h1-nms pgpool[9520]: 2019-05-28 10:14:47: pid 9520: DETAIL:
> > > connection to host:"172.18.255.11:5432" failed
> >
> > When you shutdown the server, "connect_timeout" doesn't wrok and the connection to backend server will retry until TCP/IP timeout.
> >
> > So, if you want to detach the down backend server and trigger failover, you need to enable "failover_on_backend_error" parameter or enable health check.
> >
> > > Your help is most welcome
> > > Avi
> > >
> > >
> > > service pgpool status
> > > Redirecting to /bin/systemctl status  pgpool.service
> > > * pgpool.service - Pgpool-II
> > >    Loaded: loaded (/usr/lib/systemd/system/pgpool.service; disabled; vendor preset: disabled)
> > >    Active: active (running) since Tue 2019-05-28 10:14:23 GMT; 26s
> > > ago Main PID: 9520 (pgpool)
> > >    CGroup: /system.slice/pgpool.service
> > >            |- 9520 /usr/bin/pgpool -f /etc/pgpool-II/pgpool.conf -n
> > >            |- 9537 pgpool: watchdog
> > >            |-11110 pgpool: lifecheck
> > >            |-11111 pgpool: wait for connection request
> > >            |-11113 pgpool: wait for connection request
> > >            |-11114 pgpool: wait for connection request
> > >            |-11116 pgpool: wait for connection request
> > >            |-11118 pgpool: wait for connection request
> > >            |-11119 pgpool: wait for connection request
> > >            |-11120 pgpool: wait for connection request
> > >            |-11121 pgpool: wait for connection request
> > >            |-11122 pgpool: wait for connection request
> > >            |-11123 pgpool: wait for connection request
> > >            |-11124 pgpool: wait for connection request
> > >            |-11126 pgpool: wait for connection request
> > >            |-11127 pgpool: wait for connection request
> > >            |-11128 pgpool: wait for connection request
> > >            |-11129 pgpool: wait for connection request
> > >            |-11131 pgpool: wait for connection request
> > >            |-11132 pgpool: wait for connection request
> > >            |-11133 pgpool: wait for connection request
> > >            |-11134 pgpool: wait for connection request
> > >            |-11135 pgpool: wait for connection request
> > >            |-11138 pgpool: heartbeat receiver
> > >            |-11139 pgpool: heartbeat sender
> > >            |-11140 pgpool: wait for connection request
> > >            |-11141 pgpool: wait for connection request
> > >            |-11142 pgpool: wait for connection request
> > >            |-11143 pgpool: wait for connection request
> > >            |-11144 pgpool: wait for connection request
> > >            |-11145 pgpool: wait for connection request
> > >            |-11146 pgpool: wait for connection request
> > >            |-11148 pgpool: wait for connection request
> > >            |-11149 pgpool: wait for connection request
> > >            |-11150 pgpool: wait for connection request
> > >            |-11151 pgpool: wait for connection request
> > >            |-11152 pgpool: wait for connection request
> > >            |-11153 pgpool: wait for connection request
> > >            |-11155 pgpool: wait for connection request
> > >            |-11156 pgpool: wait for connection request
> > >            |-11157 pgpool: wait for connection request
> > >            |-11158 pgpool: wait for connection request
> > >            |-11159 pgpool: wait for connection request
> > >            |-11160 pgpool: wait for connection request
> > >            |-11161 pgpool: accept connection
> > >            |-11163 pgpool: wait for connection request
> > >            |-11164 pgpool: wait for connection request
> > >            |-11165 pgpool: wait for connection request
> > >            |-11166 pgpool: wait for connection request
> > >            |-11168 pgpool: wait for connection request
> > >            |-11169 pgpool: wait for connection request
> > >            |-11170 pgpool: wait for connection request
> > >            |-11171 pgpool: wait for connection request
> > >            |-11173 pgpool: wait for connection request
> > >            |-11174 pgpool: wait for connection request
> > >            |-11177 pgpool: wait for connection request
> > >            |-11178 pgpool: wait for connection request
> > >            |-11179 pgpool: wait for connection request
> > >            |-11181 pgpool: wait for connection request
> > >            |-11182 pgpool: wait for connection request
> > >            |-11183 pgpool: wait for connection request
> > >            |-11185 pgpool: wait for connection request
> > >            |-11186 pgpool: wait for connection request
> > >            |-11187 pgpool: wait for connection request
> > >            |-11189 pgpool: wait for connection request
> > >            |-11192 pgpool: wait for connection request
> > >            |-11193 pgpool: wait for connection request
> > >            |-11194 pgpool: wait for connection request
> > >            |-11195 pgpool: wait for connection request
> > >            |-11196 pgpool: wait for connection request
> > >            |-11197 pgpool: accept connection
> > >            |-11198 pgpool: wait for connection request
> > >            |-11199 pgpool: wait for connection request
> > >            |-11200 pgpool: wait for connection request
> > >            |-11201 pgpool: wait for connection request
> > >            |-11205 pgpool: PCP: wait for connection request
> > >            `-11207 pgpool: worker process
> > >
> > > May 28 10:14:38 h1-nms pgpool[9520]: 2019-05-28 10:14:38: pid 11138: LOG:  creating watchdog heartbeat receive socket.
> > > May 28 10:14:38 h1-nms pgpool[9520]: 2019-05-28 10:14:38: pid 11138:
> > > DETAIL:  set SO_REUSEPORT May 28 10:14:38 h1-nms pgpool[9520]:
> > > 2019-05-28 10:14:38: pid 11139: LOG:  set SO_REUSEPORT option to the
> > > socket May 28 10:14:38 h1-nms pgpool[9520]: 2019-05-28 10:14:38: pid
> > > 11139: LOG:  creating socket for sending heartbeat May 28 10:14:38
> > > h1-nms pgpool[9520]: 2019-05-28 10:14:38: pid 11139: DETAIL:  set SO_REUSEPORT May 28 10:14:47 h1-nms pgpool[9520]: 2019-05-28 10:14:47: pid 11207: LOG:  trying connecting to PostgreSQL server on "172.18.255.11:5432" by INET socket May 28 10:14:47 h1-nms pgpool[9520]: 2019-05-28 10:14:47: pid 11207: DETAIL:  timed out. retrying...
> > > May 28 10:14:47 h1-nms pgpool[9520]: 2019-05-28 10:14:47: pid 9520:
> > > LOG:  failed to connect to PostgreSQL server on
> > > "172.18.255.11:5432", timed out May 28 10:14:47 h1-nms pgpool[9520]: 2019-05-28 10:14:47:
> > > pid 9520: ERROR:  failed to make persistent db connection May 28
> > > 10:14:47 h1-nms pgpool[9520]: 2019-05-28 10:14:47: pid 9520: DETAIL:
> > > connection to host:"172.18.255.11:5432" failed
> > >
> > >
> > > IMPORTANT - This email and any attachments is intended for the above named addressee(s), and may contain information which is confidential or privileged. If you are not the intended recipient, please inform the sender immediately and delete this email: you should not copy or use this e-mail for any purpose nor disclose its contents to any person.
> >
> >
> > --
> > Bo Peng <pengbo at sraoss.co.jp>
> > SRA OSS, Inc. Japan
> >
> > IMPORTANT - This email and any attachments is intended for the above named addressee(s), and may contain information which is confidential or privileged. If you are not the intended recipient, please inform the sender immediately and delete this email: you should not copy or use this e-mail for any purpose nor disclose its contents to any person.
> 
> 
> --
> Bo Peng <pengbo at sraoss.co.jp>
> SRA OSS, Inc. Japan
> 
> IMPORTANT - This email and any attachments is intended for the above named addressee(s), and may contain information which is confidential or privileged. If you are not the intended recipient, please inform the sender immediately and delete this email: you should not copy or use this e-mail for any purpose nor disclose its contents to any person.


-- 
Bo Peng <pengbo at sraoss.co.jp>
SRA OSS, Inc. Japan



More information about the pgpool-general mailing list