[pgpool-general: 6723] Re: Query

Wed Oct 2 23:18:12 JST 2019

Hi Lakshmi,

Sorry for the delayed response and many thanks for providing the log files.

I have been looking into a few similar bug reports and after reviewing the
log you sent and the ones
shared on  https://www.pgpool.net/mantisbt/view.php?id=547 I realized that
there was confusion
in the watchdog code on how to deal with the life-check failed scenarios
especially for the cases when the
life-check reports the node failure while watchdog core still able to
communicate with remote nodes.
and also for the case when node A's life-check reports node B as lost while
B still thinks A is alive and healthy.

So I have reviewed the whole watchdog design around the life-check reports
and have made some fixes.
I am not sure if you have a development setup and can verify the fix but I
am attaching the patch anyway if you
want to try that out. The patch is generated against the current MASTER
branch and I will commit it after little
more testing and then backport it to all supported branches, and hopefully,
your issue will be fixed in the upcoming
release of Pgpool-II.

Thanks
Best regards
Muhammad Usama

On Mon, Sep 2, 2019 at 9:31 AM Lakshmi Raghavendra <lakshmiym108 at gmail.com>
wrote:

> Hi Tatsuo,
>
>           Please find attached the zip file.
>
> Thanks And Regards,
>
>   Lakshmi Y M
>
> On Mon, Sep 2, 2019 at 5:13 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>
>> Hi Lakshmi,
>>
>> Your attached files are too large to accept by the mailing list. Can
>> you compress them and post the message along the compressed attached
>> files?
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>
>> From: Lakshmi Raghavendra <lakshmiym108 at gmail.com>
>> Subject: Fwd: [pgpool-general: 6672] Query
>> Date: Sun, 1 Sep 2019 23:14:30 +0530
>> Message-ID: <
>> CAHHVJ5sRoVFEEW4EoZLgudCTTm0cqGjXhbbkpnOiimcs4euUSw at mail.gmail.com>
>>
>> > ---------- Forwarded message ---------
>> > From: Lakshmi Raghavendra <lakshmiym108 at gmail.com>
>> > Date: Sat, Aug 31, 2019 at 10:17 PM
>> > Subject: Re: [pgpool-general: 6672] Query
>> > To: Tatsuo Ishii <ishii at sraoss.co.jp>
>> > Cc: Muhammad Usama <m.usama at gmail.com>, <pgpool-general at pgpool.net>
>> >
>> >
>> > Hi Usama / Tatsuo,
>> >
>> >          Received the email notification today, sorry for the delayed
>> > response.
>> > Please find attached the pgpool-II log for the same.
>> >
>> > So basically below is the short summary of the issue:
>> >
>> >
>> > Node -1 : Pgpool Master + Postgres Master
>> >
>> > Node -2 : Pgpool Standby + Postgres Standby
>> >
>> > Node-3 : Pgpool Standby + Postgres Standby
>> >
>> >
>> > When network failure happens and Node-1 goes out of network, below is
>> the
>> > status :
>> >
>> > Node-1 : Pgpool Lost status + Postgres Standby (down)
>> >
>> > Node -2 : Pgpool Master + Postgres Master
>> >
>> > Node-3 : Pgpool Standby + Postgres Standby
>> >
>> >
>> > Now when Node-1 comes back to network , below is the status causing the
>> > pgpool cluster to get into imbalance :
>> >
>> >
>> >
>> > lcm-34-189:~ # psql -h 10.198.34.191 -p 9999 -U pgpool postgres -c "show
>> > pool_nodes"
>> > Password for user pgpool:
>> >  node_id |   hostname    | port | status | lb_weight |  role   |
>> select_cnt
>> > | load_balance_node | replication_delay | last_status_change
>> >
>> ---------+---------------+------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
>> >  0       | 10.198.34.188 | 5432 | up     | 0.333333  | primary | 0
>> >  | true              | 0                 | 2019-08-31 16:40:26
>> >  1       | 10.198.34.189 | 5432 | up     | 0.333333  | standby | 0
>> >  | false             | 1013552           | 2019-08-31 16:40:26
>> >  2       | 10.198.34.190 | 5432 | up     | 0.333333  | standby | 0
>> >  | false             | 0                 | 2019-08-31 16:40:26
>> > (3 rows)
>> >
>> > lcm-34-189:~ # /usr/local/bin/pcp_watchdog_info -p 9898 -h
>> 10.198.34.191 -U
>> > pgpool
>> > Password:
>> > 3 NO lcm-34-188.dev.lcm.local:9999 Linux lcm-34-188.dev.lcm.local
>> > 10.198.34.188
>> >
>> > lcm-34-189.dev.lcm.local:9999 Linux lcm-34-189.dev.lcm.local
>> > lcm-34-189.dev.lcm.local 9999 9000 7 STANDBY
>> > lcm-34-188.dev.lcm.local:9999 Linux lcm-34-188.dev.lcm.local
>> 10.198.34.188
>> > 9999 9000 4 MASTER
>> > lcm-34-190.dev.lcm.local:9999 Linux lcm-34-190.dev.lcm.local
>> 10.198.34.190
>> > 9999 9000 4 MASTER
>> > lcm-34-189:~ #
>> >
>> >
>> >
>> > Thanks And Regards,
>> >
>> >    Lakshmi Y M
>> >
>> > On Tue, Aug 20, 2019 at 8:55 AM Tatsuo Ishii <ishii at sraoss.co.jp>
>> wrote:
>> >
>> >> > On Sat, Aug 17, 2019 at 12:28 PM Tatsuo Ishii <ishii at sraoss.co.jp>
>> >> wrote:
>> >> >
>> >> >> > Hi Pgpool Team,
>> >> >> >
>> >> >> >               *We are nearing a production release and running
>> into
>> >> the
>> >> >> > below issues.*
>> >> >> > Replies at the earliest would be highly helpful and greatly
>> >> appreciated.
>> >> >> > Please let us know on how to get rid of the below issues.
>> >> >> >
>> >> >> > We have a 3 node pgpool + postgres cluster - M1 , M2, M3. The
>> >> pgpool.conf
>> >> >> > is as attached.
>> >> >> >
>> >> >> > *Case I :  *
>> >> >> > M1 - Pgpool Master + Postgres Master
>> >> >> > M2 , M3 - Pgpool slave + Postgres slave
>> >> >> >
>> >> >> > - M1 goes out of network. its marked as LOST in the pgpool cluster
>> >> >> > - M2 becomes postgres master
>> >> >> > - M3 becomes pgpool master.
>> >> >> > - When M1 comes back to the network, pgpool is able to solve split
>> >> brain.
>> >> >> > However, its changing the postgres master back to M1 by logging a
>> >> >> statement
>> >> >> > - "LOG:  primary node was chenged after the sync from new
>> master", so
>> >> >> since
>> >> >> > M2 was already postgres master (and its trigger file is not
>> touched)
>> >> its
>> >> >> > not able to sync to the new master.
>> >> >> > *I somehow want to avoid this postgres master change..please let
>> us
>> >> know
>> >> >> if
>> >> >> > there is a way to avoid it*
>> >> >>
>> >> >> Sorry but I don't know how to prevent this. Probably when former
>> >> >> watchdog master recovers from an network outage and there's already
>> >> >> PostgreSQL primary server, the watchdog master should not sync the
>> >> >> state. What do you think Usama?
>> >> >>
>> >> >
>> >> > Yes, that's true, there is no functionality that exists in Pgpool-II
>> to
>> >> > disable the backend node status synch. In fact that
>> >> > would be hazardous if we somehow disable the node status syncing.
>> >> >
>> >> > But having said that, In the mentioned scenario when the M1 comes
>> back
>> >> and
>> >> > join the watchdog cluster Pgpool-II should have
>> >> > kept the M2 as the true master while resolving the split-brain. The
>> >> > algorithm used to resolve the true master considers quite a
>> >> > few parameters and for the scenario, you explained, M2 should have
>> kept
>> >> the
>> >> > master node status while M1 should have resigned
>> >> > after joining back the cluster and effectively the M1 node should
>> have
>> >> been
>> >> > syncing the status from M2 ( keeping the proper primary node)
>> >> > not the other way around.
>> >> > Can you please share the Pgpool-II log files so that I can have a
>> look at
>> >> > what went wrong in this case.
>> >>
>> >> Usama,
>> >>
>> >> Ok, the scenario (PostgreSQL primary x 2 in the end) should have not
>> >> happend. That's a good news.
>> >>
>> >> Lakshmi,
>> >>
>> >> Can you please provide the Pgpool-II log files as Usama requested?
>> >>
>> >> Best regards,
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS, Inc. Japan
>> >> English: http://www.sraoss.co.jp/index_en.php
>> >> Japanese:http://www.sraoss.co.jp
>> >>
>>
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20191002/5750b650/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: watchdog_node_lost_fix.diff
Type: application/octet-stream
Size: 54090 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20191002/5750b650/attachment-0001.obj>