[Pgpool-hackers] New patches for pcp_promote_node

Fri Mar 11 01:18:27 UTC 2011

> Yes I'm agree but it doesn't cover all cases too, please take a look at
> the following bug report
> http://pgfoundry.org/pipermail/pgpool-hackers/2011-January/000525.html

Yes, I had read. Your scenario is lead to so called "split brain",
where there are two (or more) primary nodes.

According to you, the condition to reproduce the problem is:

>> You have to use an existing ip address /host running postmaster like an
>> other slave. This host must not have a wal writer process.

This seems to be not so common case, don't it?

Also "split brain" could occur even easily:

- Node 0 (primary) goes down by administrator
- Node 1 automatically promotes to new primary
- The stupid administrator decides to fail back node 0
- Now you have two primary nodes(split brain)!

I believe we have even more cases which could cause split brain. Your
scenario is just one of those cases. So unless your particluar case is
the worst one and frequestly happen, let's leave find_primary_node()
as it is.

> We need to fix that, any idea ? I've attached a video for demonstration
> in the last thread response.

One idea is waiting for the promoting primary for N seconds expecting
it becomes "true" primary in the failover script(you can check it by
issuing "show transaction_read_ony"). If not, the script issues pg_ctl
to shutdown the failed-to-promoto-standby.

Probably we should have something like "pgpool-shutdown-postmaster()"
function to shutdown PostgtreSQL. This will make writing failover
script lot easier than using pg_ctl. Also this will reduce the
security risk since using pg_ctl requries ssh access from the host
where pgpool is running on to the host where PostgreSQL is running on.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp