[pgpool-general: 7251] Re: Query regarding failover and recovery

Tatsuo Ishii ishii at sraoss.co.jp
Tue Sep 1 15:38:51 JST 2020


> Hi Tatsuo,
> 
> Thanks for the detailed explanation. But I'm not expecting this. It is not
> in my hands if pg1 reboots for some reason. I was assuming that pg2 should
> continue to be master and pgpool should send the requests to pg2 until
> another failover occurs.

Of course. Just rebooting pg1 does not affect to pgpool. Pgpool-II
will keep on assuming pg1 as down.

> And at any point of time, we should not have 2
> masters. How can we avoid this situation? If we have 3 postgres nodes, will
> it solve the problem?

But someone could make a mistake anytime: for example, hit pg_ctl
promote command. What can pgpool do for such a brain dead
administrator? To avoid the confusion by such that mistake, I would
recommend:

1) Do not use -D option of pgpool. Without the option, pgpool remeber
what was the status of each backend. Without -D option,

>> pg1: down, standby
>> pg2: up, primary
>>
>> So far, so good. BUT if you restart whole system, you will find that:
>>
>> pg1: up, primary
>> pg2: up, standby

will become:

pg1: down, standby
pg2: up, primary

So far, so good. BUT if you restart whole system, you will find that:

pg1: down, standby
pg2: up, primary

2) Enable detach_false_primary.

> On Tue, Sep 1, 2020 at 11:36 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> > Hi Tatsuo,
>> >
>> > Thanks for letting me know the process. Just curious to know what will
>> > happen if pg1 reboots and comes back and in the meanwhile pg2 got
>> promoted
>> > as master. What is the expectation in this situation ?
>>
>> When pg1 shudown, failover process will mark pg1 as down, and pg2 gets
>> promoted to primary. So you have:
>>
>> pg1: down, standby
>> pg2: up, primary
>>
>> So far, so good. BUT if you restart whole system, you will find that:
>>
>> pg1: up, primary
>> pg2: up, standby
>>
>> Becuase both pg1 and pg2 are primary and there's no other info to
>> decide which one is the true primary. In this case pgpool decides that
>> the first node, which is not standby, to be primary.
>>
>> Actually p2 is not standby, and you will get into trouble because pg2
>> is not a standby synched with pg1. (you will see huge
>> relplication_delay).
>>
>> If you turn on detach_false_primary, pgpool will find that pg2 is not
>> a standby and make it down.
>>
>> > On Tue, Sep 1, 2020 at 10:32 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>> >
>> >> > Hi Tatsuo,
>> >> >
>> >> > Thanks for your email. Yes, I have read the message and fixed the pcp
>> >> > authentication issue.
>> >> >
>> >> > Now, I'm facing a different issue. I will try to explain it step by
>> step.
>> >> >
>> >> > Sequence:
>> >> > 1. pg1 is primary and pg2 is standby
>> >> > 2. Stop postgres service on pg1
>> >> > 3. Pgpool executes failover
>> >> > 4. Now pg2 is my primary. I verified all the configurations.
>> >> > 5. Started postgres service on pg1
>> >> > 6. Trying to recover failed node
>> >> >
>> >> > postgres at ip-172-31-39-241:/etc/pgpool2/4.0.9$ pcp_recovery_node -h
>> >> > localhost -p 9898 -n 0
>> >> > Password:
>> >> > ERROR:  process recovery request failed
>> >> > DETAIL:  primary server cannot be recovered by online recovery.
>> >>
>> >> You shoud have not started pg1 postgres at this
>> >> point. pcp_recovery_node should not be executed on running node. See
>> >> the manual:
>> >>
>> >> Note: The recovery target PostgreSQL server must not be running for
>> >> performing the online recovery. If the target PostgreSQL server has
>> >> already started, you must shut it down before starting the online
>> >> recovery.
>> >>
>> >> If you have not started pg1 then pcp_recovert_node would succeed.
>> >>
>> >> > Log says verify_backend_node_status: decided node 0 is the true
>> primary
>> >> >
>> >> > This command hangs
>> >> > psql -U postgres -h localhost -p 9999 --pset pager=off -c "show
>> >> pool_nodes"
>> >> >
>> >> > I would like to know how pgpool is considering node 0 which is pg1 as
>> >> true
>> >> > primary ?
>> >>
>> >> Because pgpool thinks the first primary node found is the true primary
>> >> node if there is
>> >>
>> >> > Then what did I do ?
>> >> > 1. Cleaned everything on pg1 and manually configured it as standby.
>> >> > 2. Started postgres service
>> >> > 3. Log says
>> >> > 2020-08-31 13:40:00: pid 2459: DEBUG:  do_query: extended:0
>> query:"SELECT
>> >> > pg_is_in_recovery()"
>> >> > 2020-08-31 13:40:00: pid 2459: DEBUG:  verify_backend_node_status:
>> >> there's
>> >> > no standby node
>> >> > 2020-08-31 13:40:00: pid 2459: DEBUG:  node status[0]: 0
>> >> > 2020-08-31 13:40:00: pid 2459: DEBUG:  node status[1]: 1
>> >> >
>> >> > This command executes now
>> >> > postgres at ip-172-31-39-241:/etc/pgpool2/4.0.9$ psql -U postgres -h
>> >> localhost
>> >> > -p 9999 --pset pager=off -c "show pool_nodes"
>> >> >  node_id | hostname | port | status | lb_weight |  role   |
>> select_cnt |
>> >> > load_balance_node | replication_delay | last_status_change
>> >> >
>> >>
>> ---------+----------+------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
>> >> >  0       | pg1      | 5432 | down   | 0.500000  | standby | 0
>>   |
>> >> > false             | 0                 | 2020-08-31 13:35:44
>> >> >  1       | pg2      | 5432 | up     | 0.500000  | primary | 0
>>   |
>> >> > true              | 0                 | 2020-08-31 13:35:44
>> >> > (2 rows)
>> >> >
>> >> > I checked between pg1 and pg2 and see that streaming replication is
>> >> working.
>> >> >
>> >> > Then I used pcp_attach_node to attach standby
>> >> >
>> >> > postgres at ip-172-31-39-241:/etc/pgpool2/4.0.9$ pcp_attach_node -h
>> >> localhost
>> >> > -p 9898 -n 0
>> >> > Password:
>> >> > pcp_attach_node -- Command Successful
>> >> > postgres at ip-172-31-39-241:/etc/pgpool2/4.0.9$ psql -U postgres -h
>> >> localhost
>> >> > -p 9999 --pset pager=off -c "show pool_nodes"
>> >> >  node_id | hostname | port | status | lb_weight |  role   |
>> select_cnt |
>> >> > load_balance_node | replication_delay | last_status_change
>> >> >
>> >>
>> ---------+----------+------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
>> >> >  0       | pg1      | 5432 | up     | 0.500000  | standby | 0
>>   |
>> >> > true              | 0                 | 2020-08-31 14:09:31
>> >> >  1       | pg2      | 5432 | up     | 0.500000  | primary | 0
>>   |
>> >> > false             | 0                 | 2020-08-31 14:02:35
>> >> > (2 rows)
>> >> >
>> >> > Then I tried to perform a test operation and it executed successfully.
>> >> > postgres at ip-172-31-39-241:/etc/pgpool2/4.0.9$ psql -U postgres -h
>> >> localhost
>> >> > -p 9999 --pset pager=off -c "create database covid"
>> >> > CREATE DATABASE
>> >> >
>> >> > Am I missing something in this failover process ? Please let me know
>> if
>> >> you
>> >> > need any additional details.
>> >> >
>> >> > On Tue, Sep 1, 2020 at 3:03 AM Tatsuo Ishii <ishii at sraoss.co.jp>
>> wrote:
>> >> >
>> >> >> Praveen,
>> >> >>
>> >> >> Have you read this message? I wonder if you have fixed the issue or
>> not.
>> >> >>
>> >> >> Best regards,
>> >> >> --
>> >> >> Tatsuo Ishii
>> >> >> SRA OSS, Inc. Japan
>> >> >> English: http://www.sraoss.co.jp/index_en.php
>> >> >> Japanese:http://www.sraoss.co.jp
>> >> >>
>> >> >> From: Tatsuo Ishii <ishii at sraoss.co.jp>
>> >> >> Subject: [pgpool-general: 7208] Re: Query regarding failover and
>> >> recovery
>> >> >> Date: Thu, 20 Aug 2020 08:51:04 +0900 (JST)
>> >> >> Message-ID: <20200820.085104.891242161358675858.t-ishii at sraoss.co.jp
>> >
>> >> >>
>> >> >> > The contents of pcp.conf looks incorrect.
>> >> >> >
>> >> >> >> postgres:md53175bce1d3201d16594cebf9d7eb3f9d
>> >> >> >
>> >> >> > The hashed password must not start with "md5".
>> >> >> >
>> >> >> > To create proper pcp password, please follow the instruction in the
>> >> >> manual:
>> >> >> > https://www.pgpool.net/docs/40/en/html/configuring-pcp-conf.html
>> >> >> >
>> >> >> >> Hello,
>> >> >> >>
>> >> >> >> Thanks for the clarification. I'm trying to execute and getting
>> below
>> >> >> >> error. I'm attaching configs for your reference. Can you please
>> help
>> >> ?
>> >> >> >>
>> >> >> >> postgres at pgp1:/etc/pgpool2/4.0.9$ psql -U postgres -h localhost
>> -p
>> >> 9999
>> >> >> >> --pset pager=off -c "show pool_nodes"
>> >> >> >>  node_id | hostname | port | status | lb_weight |  role   |
>> >> select_cnt |
>> >> >> >> load_balance_node | replication_delay | last_status_change
>> >> >> >>
>> >> >>
>> >>
>> ---------+----------+------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
>> >> >> >>  0       | pg1      | 5432 | down   | 0.500000  | standby | 0
>> >>   |
>> >> >> >> false             | 0                 | 2020-08-19 09:02:46
>> >> >> >>  1       | pg2      | 5432 | up     | 0.500000  | primary | 0
>> >>   |
>> >> >> >> true              | 0                 | 2020-08-19 09:02:46
>> >> >> >> (2 rows)
>> >> >> >>
>> >> >> >> postgres at pgp1:/etc/pgpool2/4.0.9$
>> >> >> >> postgres at pgp1:/etc/pgpool2/4.0.9$
>> >> >> >> postgres at pgp1:/etc/pgpool2/4.0.9$
>> >> >> >> postgres at pgp1:/etc/pgpool2/4.0.9$ pcp_recovery_node -h localhost
>> -p
>> >> >> 9898 -n
>> >> >> >> 0
>> >> >> >> Password:
>> >> >> >> FATAL:  authentication failed for user "postgres"
>> >> >> >> DETAIL:  username and/or password does not match
>> >> >> >>
>> >> >> >> postgres at pgp1:/etc/pgpool2/4.0.9$
>> >> >> >>
>> >> >> >>
>> >> >> >> On Wed, Aug 19, 2020 at 10:22 AM Tatsuo Ishii <ishii at sraoss.co.jp
>> >
>> >> >> wrote:
>> >> >> >>
>> >> >> >>> > I have 3 servers with two postgres (9.6) and one pgpool
>> (4.0.9).
>> >> >> Postgres
>> >> >> >>> > is configured with streaming replication.
>> >> >> >>> > When I manually stop postgres service on primary node, failover
>> >> has
>> >> >> >>> > happened successfully.
>> >> >> >>> > Now I started postgres service on old primary node which is
>> >> expected
>> >> >> to
>> >> >> >>> be
>> >> >> >>> > converted as slave, pgpool is not triggering
>> >> >> recovery_1st_stage_command =
>> >> >> >>> > 'recovery_1st_stage.sh'
>> >> >> >>> > May I know what could be the reason ?
>> >> >> >>>
>> >> >> >>> That is an expected behavior. The node previously brought down is
>> >> left
>> >> >> >>> as "down" by pgoool. This is intentional. You need to issue
>> >> >> >>> pcp_recovery_node against the node (previous primary node in your
>> >> >> >>> case) to make it online again.
>> >> >> >>>
>> >> >> >>> When a node is brought down, there might be a reason: for example
>> >> >> >>> needed to repair the hardware. So in general it's not safe to
>> >> >> >>> automatically restart the previously down node.
>> >> >> >>>
>> >> >> >>> Best regards,
>> >> >> >>> --
>> >> >> >>> Tatsuo Ishii
>> >> >> >>> SRA OSS, Inc. Japan
>> >> >> >>> English: http://www.sraoss.co.jp/index_en.php
>> >> >> >>> Japanese:http://www.sraoss.co.jp
>> >> >> >>>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >>
>> >> >> >>
>> >> >> >> *Regards,*
>> >> >> >>
>> >> >> >>
>> >> >> >> *K S Praveen KumarM: +91-9986855625 *
>> >> >> > _______________________________________________
>> >> >> > pgpool-general mailing list
>> >> >> > pgpool-general at pgpool.net
>> >> >> > http://www.pgpool.net/mailman/listinfo/pgpool-general
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> >
>> >> >
>> >> > *Regards,*
>> >> >
>> >> >
>> >> > *K S Praveen KumarM: +91-9986855625 *
>> >>
>> >
>> >
>> > --
>> >
>> >
>> > *Regards,*
>> >
>> >
>> > *K S Praveen KumarM: +91-9986855625 *
>> _______________________________________________
>> pgpool-general mailing list
>> pgpool-general at pgpool.net
>> http://www.pgpool.net/mailman/listinfo/pgpool-general
>>
> 
> 
> -- 
> 
> 
> *Regards,*
> 
> 
> *K S Praveen KumarM: +91-9986855625 *


More information about the pgpool-general mailing list