[pgpool-general: 5381] pgpool split brain situation : both nodes becoming master

Subhankar Chattopadhyay subho.atg at gmail.com
Wed Mar 22 18:52:07 JST 2017


Hello,

We have a 2 node PostgreSQL setup with one node as master and the
other one as slave. pgpool(3 nodes) handles the failover and makes the
slave master if old master is down.

We suddenly noticed a scenario where both the nodes became master. We
checked the pgpool1 node and found one failover that was triggered.
During the same timestamp, the pgpool log shows the following :

2017-03-14 22:43:49: pid 12810: LOG:  trying connecting to PostgreSQL
server on "10.11.11.72:5432" by INET socket

2017-03-14 22:43:49: pid 12810: DETAIL:  timed out. retrying...

2017-03-14 22:43:59: pid 12810: LOG:  trying connecting to PostgreSQL
server on "10.11.11.72:5432" by INET socket

2017-03-14 22:43:59: pid 12810: DETAIL:  timed out. retrying...

2017-03-14 22:44:09: pid 12810: LOG:  trying connecting to PostgreSQL
server on "10.11.11.72:5432" by INET socket

2017-03-14 22:44:09: pid 12810: DETAIL:  timed out. retrying...

2017-03-14 22:44:19: pid 12810: LOG:  trying connecting to PostgreSQL
server on "10.11.11.72:5432" by INET socket

2017-03-14 22:44:19: pid 12810: DETAIL:  timed out. retrying...

2017-03-14 22:44:29: pid 12810: LOG:  trying connecting to PostgreSQL
server on "10.11.11.72:5432" by INET socket

2017-03-14 22:44:29: pid 12810: DETAIL:  timed out. retrying...

2017-03-14 22:44:39: pid 12810: LOG:  trying connecting to PostgreSQL
server on "10.11.11.72:5432" by INET socket

2017-03-14 22:44:39: pid 12810: DETAIL:  timed out. retrying...

2017-03-14 22:44:49: pid 12810: LOG:  trying connecting to PostgreSQL
server on "10.11.11.72:5432" by INET socket

2017-03-14 22:44:49: pid 12810: DETAIL:  timed out. retrying...

2017-03-14 22:44:56: pid 12810: LOG:  failed to connect to PostgreSQL
server on "10.11.11.72:5432", getsockopt() detected error "Connection
timed out"

2017-03-14 22:44:56: pid 12810: ERROR:  failed to make persistent db connection

2017-03-14 22:44:56: pid 12810: DETAIL:  connection to
host:"10.11.11.72:5432" failed

2017-03-14 22:45:06: pid 12810: ERROR:  Failed to check replication time lag

2017-03-14 22:45:06: pid 12810: DETAIL:  No persistent db connection
for the node 0

2017-03-14 22:45:06: pid 12810: HINT:  check sr_check_user and sr_check_password

2017-03-14 22:45:06: pid 12810: CONTEXT:  while checking replication time lag

2017-03-16 11:12:59: pid 12257: LOG:  remote node
"Linux_c15e85d3-21a8-45e8-ae4e-07c2e7423782_9999" is shutting down

2017-03-16 11:12:59: pid 12257: LOG:  read from socket failed, remote
end closed the connection

2017-03-16 11:12:59: pid 12257: LOG:  read from socket failed, remote
end closed the connection

2017-03-16 11:13:19: pid 12257: LOG:  new watchdog node connection is
received from "10.11.11.75:3768"

2017-03-16 11:13:19: pid 12257: LOG:  new outbond connection to 10.11.11.75:9000

2017-03-16 15:09:43: pid 12257: LOG:  remote node
"Linux_c15e85d3-21a8-45e8-ae4e-07c2e7423782_9999" is shutting down

2017-03-16 15:09:43: pid 12257: LOG:  read from socket failed, remote
end closed the connection

2017-03-16 15:09:43: pid 12257: LOG:  read from socket failed, remote
end closed the connection

2017-03-16 15:10:00: pid 12257: LOG:  new watchdog node connection is
received from "10.11.11.75:37077"

2017-03-16 15:10:00: pid 12257: LOG:  new outbond connection to 10.11.11.75:9000

2017-03-17 12:06:04: pid 12810: LOG:  trying connecting to PostgreSQL
server on "10.11.11.72:5432" by INET socket

2017-03-17 12:06:04: pid 12810: DETAIL:  timed out. retrying...

2017-03-18 12:53:25: pid 12810: LOG:  trying connecting to PostgreSQL
server on "10.11.11.72:5432" by INET socket

2017-03-18 12:53:25: pid 12810: DETAIL:  timed out. retrying...

2017-03-18 12:53:35: pid 12810: LOG:  trying connecting to PostgreSQL
server on "10.11.11.72:5432" by INET socket

2017-03-18 12:53:35: pid 12810: DETAIL:  timed out. retrying...

2017-03-18 12:53:45: pid 12810: LOG:  trying connecting to PostgreSQL
server on "10.11.11.72:5432" by INET socket

2017-03-18 12:53:45: pid 12810: DETAIL:  timed out. retrying...

2017-03-18 12:53:55: pid 12810: LOG:  trying connecting to PostgreSQL
server on "10.11.11.72:5432" by INET socket

2017-03-18 12:53:55: pid 12810: DETAIL:  timed out. retrying...

2017-03-18 12:54:05: pid 12810: LOG:  trying connecting to PostgreSQL
server on "10.11.11.72:5432" by INET socket

2017-03-18 12:54:05: pid 12810: DETAIL:  timed out. retrying...

2017-03-18 12:54:15: pid 12810: LOG:  trying connecting to PostgreSQL
server on "10.11.11.72:5432" by INET socket

2017-03-18 12:54:15: pid 12810: DETAIL:  timed out. retrying...

2017-03-18 12:55:16: pid 12257: LOG:  received degenerate backend
request for node_id: 0 from pid [12257]

2017-03-18 12:55:16: pid 12253: LOG:  starting degeneration. shutdown
host 10.11.11.72(5432)

2017-03-18 12:55:16: pid 12253: LOG:  Restart all children

2017-03-18 12:55:16: pid 12744: LOG:  child process received shutdown
request signal 3



However, show pool nodes shows the following :

uaa=# show pool_nodes;
 node_id | hostname  | port | status | lb_weight |  role   | select_cnt
---------+-----------+------+--------+-----------+---------+------------
 0       | 10.3.6.21 | 5432 | 2      | 0.500000  | primary | 0
 1       | 10.3.6.29 | 5432 | 2      | 0.500000  | standby | 0
(2 rows)


Can someone help me in understanding how the split brain happened and
how to protect from it? Please let me know if something else is
needed.

Regards,
Subhankar Chattopadhyay
Bangalore, India


More information about the pgpool-general mailing list