View Issue Details

IDProjectCategoryView StatusLast Update
0000162Pgpool-IIBugpublic2016-01-04 15:03
ReporterharukatAssigned Tonagata 
PrioritynormalSeverityminorReproducibilityalways
Status resolvedResolutionfixed 
Product Version3.4.0 
Target VersionFixed in Version 
Summary0000162: To start watchdog pgpool nodes simultaneously is not safe
DescriptionTo start watchdog pgpool nodes simultaneously is not safe.
There is the case that the following happens.

- "FATAL: failed to initialize watchdog, delegate_IP "x.x.x.x" already exists"
- Master nodes more than two are born
- The recognitions of the invalid node are not consistent.

There are not robust initial coordination logic.
So I think that we describe clearly "must start it one by one" to the manual.
Steps To ReproduceTested version: pgpool2-V3_4_STABLE-3d2093e (snapshot@2015-12-20) + patch

The patch file improves 'DETAIL: connect() reports failure "Success"',
but it isn't related to this main issue.

I set up the following watchdog cluster and started 4 pgpools simultaneously.
(pg1, pg2 ... are hostname in my test environment)

 [pg1 (pgpool)] ---------+- [pg3 (PostgreSQL)]
 [pg2 (pgpool)] ---------+
 [pgpool1 (pgpool)] -----+
 [pgpool2 (pgpool)] -----+
TagsNo tags attached.

Activities

harukat

2015-12-20 20:28

developer  

pgpool_conf_and_patch.zip (26,316 bytes)

nagata

2015-12-25 14:25

developer   ~0000622

I realized this problem. Yes, watchdog fails to start when pgpools are started with too short interval.

At start-up, (1) pgpool tries to connect other pgpools. If there is no pgpool then (2) it checks there is no VIP. If VIP isn't brought up, (3) it becomes master and brings VIP up. After that, (4) a wachdog child process is forked and watchdog gets ready to accept other pgpools at that point.

However, when another pgpool tries to start between (3) and (4), this comes to try to become master since there are not any pgpool to accept this before (4), but it fails with "delegate_IP xxx already exists" since VIP is already brought up at (3).
( - "FATAL: failed to initialize watchdog, delegate_IP "x.x.x.x" already exists")

Second, when another pgpool tries to start between (1) and (2), there are not any pgpool to accept this and VIP isn't brought up, so this succeeds to become
another master. And what is worse, this escalation is not announced since the watchdog child is not forked at the first watchdog before (4). This is a split-brain situation.
(- Master nodes more than two are born)

In the above scenario, sometimes the escalation announce is accepted by the first pgpool successfully after (4). In this case, the first pgpool becomes standby and the second pgpool becomes master, but the second pgpool doesn't receive any packet from the first pgpool, so the recognition among cluster can be inconsistent.
( - The recognitions of the invalid node are not consistent.)

I tried to resolve this problem but good solution aren't found yet. It is good idea to add description about this restriction into documents. I'll do it.


Your patch seems to be good. I'll look into this and commit this.

nagata

2015-12-28 11:58

developer   ~0000627

I add the descriptions into the document.
http://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=68cbdd88477f8294cab0324d627f8cff213df958

About the patch, I got confused... Could you explain why errno comes to be 0 even if connect() returns -1 and how the patch fixes it?

harukat

2015-12-29 12:02

developer   ~0000631

> About the patch, I got confused... Could you explain why errno comes to be 0
> even if connect() returns -1 and how the patch fixes it?

You would also often see the messages like "reports failure: Success" from pgpool.
I think that all of the following code are unsafe about direct use of errno.

  ereport(LOG, (errmsg("..."),
    errdetail("... %s ...", strerror(errno)), errhint("...")));

The evaluation sequence of errmsg(), errdetail() and errhint() is not stable
by its compiling.
errmsg() and errhint() call some system-calls and library functions.

nagata

2016-01-04 12:59

developer   ~0000634

Thanks. I understand. I'll commit this.

nagata

2016-01-04 15:03

developer   ~0000635

committed.

http://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=a485b2c73a9de96b751adb545f06a6e9f09d86cb

Issue History

Date Modified Username Field Change
2015-12-20 20:28 harukat New Issue
2015-12-20 20:28 harukat File Added: pgpool_conf_and_patch.zip
2015-12-25 13:13 t-ishii Assigned To => nagata
2015-12-25 13:13 t-ishii Status new => assigned
2015-12-25 13:13 t-ishii Description Updated View Revisions
2015-12-25 13:13 t-ishii Steps to Reproduce Updated View Revisions
2015-12-25 14:25 nagata Note Added: 0000622
2015-12-28 11:58 nagata Note Added: 0000627
2015-12-28 11:58 nagata Status assigned => feedback
2015-12-29 12:02 harukat Note Added: 0000631
2015-12-29 12:02 harukat Status feedback => assigned
2016-01-04 12:59 nagata Note Added: 0000634
2016-01-04 15:03 nagata Note Added: 0000635
2016-01-04 15:03 nagata Status assigned => resolved
2016-01-04 15:03 nagata Resolution open => fixed