[pgpool-general: 5385] Re: pgpool split brain situation : both nodes becoming master

Subhankar Chattopadhyay subho.atg at gmail.com
Thu Mar 23 20:58:33 JST 2017


Hello,

Can someone help me with this?

On Wed, Mar 22, 2017 at 3:22 PM, Subhankar Chattopadhyay
<subho.atg at gmail.com> wrote:
> Hello,
>
> We have a 2 node PostgreSQL setup with one node as master and the
> other one as slave. pgpool(3 nodes) handles the failover and makes the
> slave master if old master is down.
>
> We suddenly noticed a scenario where both the nodes became master. We
> checked the pgpool1 node and found one failover that was triggered.
> During the same timestamp, the pgpool log shows the following :
>
> 2017-03-14 22:43:49: pid 12810: LOG:  trying connecting to PostgreSQL
> server on "10.11.11.72:5432" by INET socket
>
> 2017-03-14 22:43:49: pid 12810: DETAIL:  timed out. retrying...
>
> 2017-03-14 22:43:59: pid 12810: LOG:  trying connecting to PostgreSQL
> server on "10.11.11.72:5432" by INET socket
>
> 2017-03-14 22:43:59: pid 12810: DETAIL:  timed out. retrying...
>
> 2017-03-14 22:44:09: pid 12810: LOG:  trying connecting to PostgreSQL
> server on "10.11.11.72:5432" by INET socket
>
> 2017-03-14 22:44:09: pid 12810: DETAIL:  timed out. retrying...
>
> 2017-03-14 22:44:19: pid 12810: LOG:  trying connecting to PostgreSQL
> server on "10.11.11.72:5432" by INET socket
>
> 2017-03-14 22:44:19: pid 12810: DETAIL:  timed out. retrying...
>
> 2017-03-14 22:44:29: pid 12810: LOG:  trying connecting to PostgreSQL
> server on "10.11.11.72:5432" by INET socket
>
> 2017-03-14 22:44:29: pid 12810: DETAIL:  timed out. retrying...
>
> 2017-03-14 22:44:39: pid 12810: LOG:  trying connecting to PostgreSQL
> server on "10.11.11.72:5432" by INET socket
>
> 2017-03-14 22:44:39: pid 12810: DETAIL:  timed out. retrying...
>
> 2017-03-14 22:44:49: pid 12810: LOG:  trying connecting to PostgreSQL
> server on "10.11.11.72:5432" by INET socket
>
> 2017-03-14 22:44:49: pid 12810: DETAIL:  timed out. retrying...
>
> 2017-03-14 22:44:56: pid 12810: LOG:  failed to connect to PostgreSQL
> server on "10.11.11.72:5432", getsockopt() detected error "Connection
> timed out"
>
> 2017-03-14 22:44:56: pid 12810: ERROR:  failed to make persistent db connection
>
> 2017-03-14 22:44:56: pid 12810: DETAIL:  connection to
> host:"10.11.11.72:5432" failed
>
> 2017-03-14 22:45:06: pid 12810: ERROR:  Failed to check replication time lag
>
> 2017-03-14 22:45:06: pid 12810: DETAIL:  No persistent db connection
> for the node 0
>
> 2017-03-14 22:45:06: pid 12810: HINT:  check sr_check_user and sr_check_password
>
> 2017-03-14 22:45:06: pid 12810: CONTEXT:  while checking replication time lag
>
> 2017-03-16 11:12:59: pid 12257: LOG:  remote node
> "Linux_c15e85d3-21a8-45e8-ae4e-07c2e7423782_9999" is shutting down
>
> 2017-03-16 11:12:59: pid 12257: LOG:  read from socket failed, remote
> end closed the connection
>
> 2017-03-16 11:12:59: pid 12257: LOG:  read from socket failed, remote
> end closed the connection
>
> 2017-03-16 11:13:19: pid 12257: LOG:  new watchdog node connection is
> received from "10.11.11.75:3768"
>
> 2017-03-16 11:13:19: pid 12257: LOG:  new outbond connection to 10.11.11.75:9000
>
> 2017-03-16 15:09:43: pid 12257: LOG:  remote node
> "Linux_c15e85d3-21a8-45e8-ae4e-07c2e7423782_9999" is shutting down
>
> 2017-03-16 15:09:43: pid 12257: LOG:  read from socket failed, remote
> end closed the connection
>
> 2017-03-16 15:09:43: pid 12257: LOG:  read from socket failed, remote
> end closed the connection
>
> 2017-03-16 15:10:00: pid 12257: LOG:  new watchdog node connection is
> received from "10.11.11.75:37077"
>
> 2017-03-16 15:10:00: pid 12257: LOG:  new outbond connection to 10.11.11.75:9000
>
> 2017-03-17 12:06:04: pid 12810: LOG:  trying connecting to PostgreSQL
> server on "10.11.11.72:5432" by INET socket
>
> 2017-03-17 12:06:04: pid 12810: DETAIL:  timed out. retrying...
>
> 2017-03-18 12:53:25: pid 12810: LOG:  trying connecting to PostgreSQL
> server on "10.11.11.72:5432" by INET socket
>
> 2017-03-18 12:53:25: pid 12810: DETAIL:  timed out. retrying...
>
> 2017-03-18 12:53:35: pid 12810: LOG:  trying connecting to PostgreSQL
> server on "10.11.11.72:5432" by INET socket
>
> 2017-03-18 12:53:35: pid 12810: DETAIL:  timed out. retrying...
>
> 2017-03-18 12:53:45: pid 12810: LOG:  trying connecting to PostgreSQL
> server on "10.11.11.72:5432" by INET socket
>
> 2017-03-18 12:53:45: pid 12810: DETAIL:  timed out. retrying...
>
> 2017-03-18 12:53:55: pid 12810: LOG:  trying connecting to PostgreSQL
> server on "10.11.11.72:5432" by INET socket
>
> 2017-03-18 12:53:55: pid 12810: DETAIL:  timed out. retrying...
>
> 2017-03-18 12:54:05: pid 12810: LOG:  trying connecting to PostgreSQL
> server on "10.11.11.72:5432" by INET socket
>
> 2017-03-18 12:54:05: pid 12810: DETAIL:  timed out. retrying...
>
> 2017-03-18 12:54:15: pid 12810: LOG:  trying connecting to PostgreSQL
> server on "10.11.11.72:5432" by INET socket
>
> 2017-03-18 12:54:15: pid 12810: DETAIL:  timed out. retrying...
>
> 2017-03-18 12:55:16: pid 12257: LOG:  received degenerate backend
> request for node_id: 0 from pid [12257]
>
> 2017-03-18 12:55:16: pid 12253: LOG:  starting degeneration. shutdown
> host 10.11.11.72(5432)
>
> 2017-03-18 12:55:16: pid 12253: LOG:  Restart all children
>
> 2017-03-18 12:55:16: pid 12744: LOG:  child process received shutdown
> request signal 3
>
>
>
> However, show pool nodes shows the following :
>
> uaa=# show pool_nodes;
>  node_id | hostname  | port | status | lb_weight |  role   | select_cnt
> ---------+-----------+------+--------+-----------+---------+------------
>  0       | 10.3.6.21 | 5432 | 2      | 0.500000  | primary | 0
>  1       | 10.3.6.29 | 5432 | 2      | 0.500000  | standby | 0
> (2 rows)
>
>
> Can someone help me in understanding how the split brain happened and
> how to protect from it? Please let me know if something else is
> needed.
>
> Regards,
> Subhankar Chattopadhyay
> Bangalore, India



-- 




Subhankar Chattopadhyay
Bangalore, India


More information about the pgpool-general mailing list