[pgpool-general: 5113] Re: pgpool incorrectly thinks backend cluster is down

manoj amanualjolt at gmail.com
Thu Nov 10 00:57:34 JST 2016


I actually tried to replicate this scenario by issuing a select statement
to pgpool and running a pg_terminate_backend command on that query in the
backend cluster. This did not trigger a cluster eviction. Also, after
sometime, it evicted one of the clusters when there was no
pg_terminate_backend running. Can pg_terminate_backend on any session cause
this or does it need to terminate the specific session that is being
handled by pgpool. Either way, I could not replicate this scenario in my
testing. What else could have caused this?

2016-11-08 21:00:19: pid 14686: LOG:  reading and processing packets
2016-11-08 21:00:19: pid 14686: DETAIL:  postmaster on DB node 0 was
shutdown by administrative command
2016-11-08 21:00:19: pid 14686: LOG:  received degenerate backend request
for node_id: 0 from pid [14686]
2016-11-08 21:00:19: pid 6098: LOG:  starting degeneration. shutdown host
armada.cutpdh4v4uf7.us-east-1.redshift.amazonaws.com(5439)
2016-11-08 21:00:19: pid 6098: LOG:  Restart all children
2016-11-08 21:00:19: pid 6098: LOG:  failover: set new primary node: -1
2016-11-08 21:00:19: pid 18576: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6098: LOG:  failover: set new master node: 1
2016-11-08 21:00:19: pid 10896: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 18491: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6099: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6556: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 18495: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 9492: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6128: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6549: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 19025: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6103: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6127: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6554: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 19026: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 14007: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 18494: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6553: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 18575: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 14686: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 18492: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6123: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6552: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6113: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6119: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 18497: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6117: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6104: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 18493: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 18496: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 12497: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6109: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6106: LOG:  child process received shutdown
request signal 3
2016-11-08 21:00:19: pid 6132: LOG:  worker process received restart request
failover done. shutdown host armada.amazonaws.com(5439)2016-11-08 21:00:19:
pid 6098: LOG:  failover done. shutdown host armada.amazonaws.com(5439)
2016-11-08 21:00:20: pid 6131: LOG:  restart request received in pcp child
process
2016-11-08 21:00:20: pid 6098: LOG:  PCP child 6131 exits with status 0 in
failover()
2016-11-08 21:00:20: pid 6098: LOG:  fork a new PCP child pid 26992 in
failover()


For the health check error below, why is this happening only with db node 0
and not with db node 1 when both of the underlying clusters have the exact
same redshift configuration?

2016-11-08 21:07:46: pid 6098: LOG:  pool_ssl: "SSL_read": "no SSL error
reported"
2016-11-08 21:07:46: pid 6098: LOG:  notice_backend_error: called from
pgpool main. ignored.
2016-11-08 21:07:46: pid 6098: WARNING:  child_exit: called from invalid
process. ignored.
2016-11-08 21:07:46: pid 6098: ERROR:  unable to read data from DB node 0
2016-11-08 21:07:46: pid 6098: DETAIL:  socket read failed with an error
"Success"


Thanks,
- Manoj

On Wed, Nov 9, 2016 at 1:46 AM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> >>> Whenever Pgpool-II thinks a backend is being down, there should be a
> >>>log entry in the Pgpool-II log file. Please check.
> >>
> >> This is the error in the log file when this happens
> >>
> >> 2016-11-02 00:00:07: pid 9217: DETAIL:  postmaster on DB node 0 was
> >> shutdown by administrative command
> >> 2016-11-02 00:00:07: pid 9217: LOG:  received degenerate backend request
> >> for node_id: 0 from pid [9217]
> >> 2016-11-02 00:00:07: pid 9188: LOG:  starting degeneration. shutdown
> host
> >> prod1.amazonaws.com(5439)
> >> 2016-11-02 00:00:07: pid 9188: LOG:  Restart all children
> >>
> >> What does "postmaster on DB node 0 was shutdown by administrative
> command".
> >> I havent sent any shutdown commands to pgpool.
> >
> > Someone shutdown PostgreSQL (or used pg_cancel_backend).
>
> Correction. I meant pg_terminate_backend, rather than
> pg_cancel_backend (it just cancel current query if any and harmless
> for Pgpool-II).
>
> >> I verify connectivity to the
> >> cluster whenever this happens and it is always fine. Why does the health
> >> check that I configured to run every 30 secs not sense that the cluster
> is
> >> back up again and update the pgpool_status file?
> >
> > See the FAQ.
> > http://www.pgpool.net/mediawiki/index.php/FAQ#Why_does_not_Pgpool-II_
> automatically_recognize_a_database_comes_back_online.3F
> >
> >> Health check details from
> >> the log are below
> >>
> >> 2016-11-01 23:59:54: pid 9188: LOG:  notice_backend_error: called from
> >> pgpool main. ignored.
> >> 2016-11-01 23:59:54: pid 9188: WARNING:  child_exit: called from invalid
> >> process. ignored.
> >
> > No worry for this part. There was a race condition inside Pgpool-II
> > but was resolved.
> >
> >> 2016-11-01 23:59:54: pid 9188: ERROR:  unable to read data from DB node
> 0
> >> 2016-11-01 23:59:54: pid 9188: DETAIL:  socket read failed with an error
> >> "Success"
> >>
> >> What dos the above log indicate?
> >
> > DB node 0 disconnected the socket to Pgpool-II.
> >
> >>>Yes, it randomly routes to backends. You can control the possibility
> >>>of the routing.
> >>
> >> Is it possible to control routing using round robin approach or least
> used
> >> cluster? If so, where do I configure this?
> >
> > No.
> >
> >> Thanks,
> >> - Manoj
> >>
> >> On Mon, Nov 7, 2016 at 12:08 AM, Tatsuo Ishii <ishii at sraoss.co.jp>
> wrote:
> >>
> >>> > I have pgpool configured against two redshift backend clusters to do
> >>> > parallel writes. Seemingly at random, pgpool determines that one or
> both
> >>> > the clusters are down and stops accepting connections even when they
> are
> >>> > not down. I have health check configured every 30 seconds but that
> does
> >>> not
> >>> > help as it checks heath and still determines they are down in
> >>> pgpool_status
> >>> > file. How is health status determined and written to the file
> >>> > /var/log/pgpool/pgpool_status and why does pgpool think the clusters
> are
> >>> > down when they are not?
> >>>
> >>> Whenever Pgpool-II thinks a backend is being down, there should be a
> >>> log entry in the Pgpool-II log file. Please check.
> >>>
> >>> > I also tested read query routing and noticed they were being routed
> >>> > randomly to the backend clusters. Is there a specific algorithm that
> >>> pgpool
> >>> > uses for read query routing?
> >>>
> >>> Yes, it randomly routes to backends. You can control the possibility
> >>> of the routing.
> >>>
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > My config parameters are below
> >>> >
> >>> >
> >>> >
> >>> > backend_hostname0 = 'cluster1'
> >>> >
> >>> > backend_port0 = 5439
> >>> >
> >>> > backend_weight0 = 1
> >>> >
> >>> > backend_data_directory0 = '/data1'
> >>> >
> >>> > backend_flag0 = 'ALLOW_TO_FAILOVER'
> >>> >
> >>> >
> >>> >
> >>> > backend_hostname1 = 'cluster2'
> >>> >
> >>> > backend_port1 = 5439
> >>> >
> >>> > backend_weight1 = 1
> >>> >
> >>> > backend_data_directory1 = '/data1'
> >>> >
> >>> > backend_flag1 = 'ALLOW_TO_FAILOVER'
> >>> >
> >>> >
> >>> >
> >>> > #-----------------------------------------------------------
> >>> > -------------------
> >>> >
> >>> > # HEALTH CHECK
> >>> >
> >>> > #-----------------------------------------------------------
> >>> > -------------------
> >>> >
> >>> >
> >>> >
> >>> > health_check_period = 30
> >>> >
> >>> >                                    # Health check period
> >>> >
> >>> >                                    # Disabled (0) by default
> >>> >
> >>> > health_check_timeout = 20
> >>> >
> >>> >                                    # Health check timeout
> >>> >
> >>> >                                    # 0 means no timeout
> >>> >
> >>> > health_check_user = 'username'
> >>> >
> >>> >                                    # Health check user
> >>> >
> >>> > health_check_password = 'password'
> >>> >
> >>> >                                    # Password for health check user
> >>> >
> >>> > health_check_max_retries = 10
> >>> >
> >>> >                                    # Maximum number of times to
> retry a
> >>> > failed health check before giving up.
> >>> >
> >>> > health_check_retry_delay = 1
> >>> >
> >>> >                                    # Amount of time to wait (in
> seconds)
> >>> > between retries.
> >>>
> > _______________________________________________
> > pgpool-general mailing list
> > pgpool-general at pgpool.net
> > http://www.pgpool.net/mailman/listinfo/pgpool-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20161109/e5e534a4/attachment.html>


More information about the pgpool-general mailing list