View Issue Details

IDProjectCategoryView StatusLast Update
0000044Pgpool-IIBugpublic2014-11-14 17:24
ReportertuomasAssigned Toanzai 
PriorityhighSeveritycrashReproducibilityhave not tried
Status resolvedResolutionopen 
Platformamd64OSubuntuOS Version12.04
Product Version 
Target VersionFixed in Version 
Summary0000044: pgpool went haywire after slave shutdown triggering master failover
DescriptionRunning pgpool with 3 db-nodes in streaming master-slave replication node, no load balancing, etc.

Shutting down slave server caused pgpool to look for new primary and that failed.
Then it tried to make another server a primary, but that failed, too.

The previous master and the third server were working fine and there were no connectivity issues with them.

Postgresql server log is full of "incomplete startup packet"
Additional Informationdb-node0 = current primary, db-node1= slave, db-node2=slave

Log:
### node2 goes down
Dec 10 14:32:08 yellowfin pgpool[20888]: connect_inet_domain_socket: connect() failed: Network is unreachable
Dec 10 14:32:08 yellowfin pgpool[20888]: make_persistent_db_connection: connection to db-node2(5432) failed
Dec 10 14:32:08 yellowfin pgpool[20888]: health check failed. 2 th host db-node2 at port 5432 is down
Dec 10 14:32:08 yellowfin pgpool[20888]: health check retry sleep time: 1 second(s)
...
Dec 10 14:32:16 yellowfin pgpool[2000]: connect_inet_domain_socket: connect() failed: Network is unreachable
Dec 10 14:32:16 yellowfin pgpool[2000]: connection to db-node2(5432) failed
Dec 10 14:32:16 yellowfin pgpool[2000]: new_connection: create_cp() failed
Dec 10 14:32:16 yellowfin pgpool[2000]: degenerate_backend_set: 2 fail over request from pid 2000
...
Dec 10 14:32:29 yellowfin pgpool[20888]: connect_inet_domain_socket_by_port: health check timer expired
Dec 10 14:32:29 yellowfin pgpool[20888]: make_persistent_db_connection: connection to db-node2(5432) failed
Dec 10 14:32:29 yellowfin pgpool[20888]: health check failed. 2 th host db-node2 at port 5432 is down
Dec 10 14:32:29 yellowfin pgpool[20888]: health check retry sleep time: 1 second(s)
Dec 10 14:32:29 yellowfin pgpool[20888]: starting degeneration. shutdown host db-node2(5432)
Dec 10 14:32:29 yellowfin pgpool[20888]: Restart all children
### Why restart all children if slave goes down, breaking all connections?


Dec 10 14:32:29 yellowfin pgpool[20888]: execute command: /usr/local/bin/pg_failover.sh 2 db-node0 /etc/pg_master_trigger 0 0 0
Dec 10 14:32:29 yellowfin pgpool[20888]: find_primary_node_repeatedly: waiting for finding a primary node
Dec 10 14:32:29 yellowfin pgpool[20888]: connect_inet_domain_socket_by_port: health check timer expired
...
for some reason it keeps failing, although the server and connectivity is fine
Dec 10 14:32:49 yellowfin pgpool[20888]: make_persistent_db_connection: connection to db-node0(5432) failed
Dec 10 14:32:49 yellowfin pgpool[20888]: find_primary_node: make_persistent_connection failed
Dec 10 14:33:59 yellowfin pgpool[20888]: failover: set new primary node: -1
Dec 10 14:33:59 yellowfin pgpool[20888]: failover: set new master node: 0
Dec 10 14:33:59 yellowfin pgpool[4873]: connect_inet_domain_socket_by_port: health check timer expired
Dec 10 14:33:59 yellowfin pgpool[4873]: connection to db-node0(5432) failed
Dec 10 14:33:59 yellowfin pgpool[4873]: new_connection: create_cp() failed
Dec 10 14:33:59 yellowfin pgpool[4873]: degenerate_backend_set: 0 fail over request from pid 4873
Dec 10 14:33:59 yellowfin pgpool[4874]: connect_inet_domain_socket_by_port: health check timer expired
Dec 10 14:33:59 yellowfin pgpool[4874]: connection to db-node0(5432) failed
Dec 10 14:33:59 yellowfin pgpool[4874]: new_connection: create_cp() failed
Dec 10 14:33:59 yellowfin pgpool[4876]: degenerate_backend_set: 0 fail over request from pid 4876
Dec 10 14:33:59 yellowfin pgpool[4878]: connect_inet_domain_socket_by_port: health check timer expired
Dec 10 14:33:59 yellowfin pgpool[4872]: connect_inet_domain_socket_by_port: health check timer expired
Dec 10 14:33:59 yellowfin pgpool[4878]: connection to db-node0(5432) failed
Dec 10 14:33:59 yellowfin pgpool[4878]: new_connection: create_cp() failed
Dec 10 14:33:59 yellowfin pgpool[4872]: connection to db-node0(5432) failed
Dec 10 14:33:59 yellowfin pgpool[4872]: new_connection: create_cp() failed
Dec 10 14:33:59 yellowfin pgpool[4878]: degenerate_backend_set: 0 fail over request from pid 4878
Dec 10 14:33:59 yellowfin pgpool[4872]: degenerate_backend_set: 0 fail over request from pid 4872
Dec 10 14:33:59 yellowfin pgpool[4924]: connect_inet_domain_socket_by_port: health check timer expired
Dec 10 14:33:59 yellowfin pgpool[4939]: connection to db-node0(5432) failed
Dec 10 14:33:59 yellowfin pgpool[4914]: new_connection: create_cp() failed
Dec 10 14:34:00 yellowfin pgpool[20888]: worker child 31741 exits with status 256
Dec 10 14:34:00 yellowfin pgpool[20888]: fork a new worker child pid 5173
Dec 10 14:34:00 yellowfin pgpool[20888]: starting degeneration. shutdown host db-node0(5432)
Dec 10 14:34:00 yellowfin pgpool[20888]: Restart all children
Dec 10 14:34:00 yellowfin pgpool[20888]: execute command: /usr/local/bin/pg_failover.sh 0 db-node1 /etc/pg_master_trigger 0 0 1
Dec 10 14:34:00 yellowfin pgpool[5173]: connect_inet_domain_socket_by_port: health check timer expired
Dec 10 14:34:00 yellowfin pgpool[5173]: make_persistent_db_connection: connection to db-node1(5432) failed
Dec 10 14:34:00 yellowfin pgpool[31742]: pcp child process received restart request
Dec 10 14:34:10 yellowfin pgpool[5173]: connect_inet_domain_socket_by_port: health check timer expired
Dec 10 14:34:10 yellowfin pgpool[5173]: make_persistent_db_connection: connection to db-node1(5432) failed
Dec 10 14:34:20 yellowfin pgpool[5173]: connect_inet_domain_socket_by_port: health check timer expired
Dec 10 14:34:20 yellowfin pgpool[5173]: make_persistent_db_connection: connection to db-node1(5432) failed
Dec 10 14:34:30 yellowfin pgpool[5173]: connect_inet_domain_socket_by_port: health check timer expired
Dec 10 14:34:30 yellowfin pgpool[5173]: make_persistent_db_connection: connection to db-node1(5432) failed
Dec 10 14:34:38 yellowfin pgpool[20888]: find_primary_node_repeatedly: waiting for finding a primary node
Dec 10 14:34:38 yellowfin pgpool[20888]: connect_inet_domain_socket_by_port: health check timer expired
Dec 10 14:34:38 yellowfin pgpool[20888]: make_persistent_db_connection: connection to db-node1(5432) failed
Dec 10 14:34:38 yellowfin pgpool[20888]: find_primary_node: make_persistent_connection failed
Dec 10 14:34:39 yellowfin pgpool[20888]: connect_inet_domain_socket_by_port: health check timer expired
Dec 10 14:34:39 yellowfin pgpool[20888]: make_persistent_db_connection: connection to db-node1(5432) failed
Dec 10 14:34:39 yellowfin pgpool[20888]: find_primary_node: make_persistent_connection failed
Dec 10 14:34:40 yellowfin pgpool[5173]: connect_inet_domain_socket_by_port: health check timer expired
Dec 10 14:34:40 yellowfin pgpool[5173]: make_persistent_db_connection: connection to db-node1(5432) failed
Dec 10 14:34:40 yellowfin pgpool[20888]: connect_inet_domain_socket_by_port: health check timer expired
Dec 10 14:34:40 yellowfin pgpool[20888]: make_persistent_db_connection: connection to db-node1(5432) failed
...
TagsNo tags attached.

Activities

tuomas

2012-12-13 07:11

reporter   ~0000180

Probably worth adding that after restarting pgpool it was able to connect the servers immediately without any issues.

I also got another bug:

pgpool was started to a state where 2 out of 3 servers (master and slave) were down. pgpool failed to make the last available server a master (it didn't even try), but tried to send write requests to the slave server:

Dec 12 17:31:52 yellowfin pgpool[27909]: read_status_file: 0 th backend is set to down status
Dec 12 17:31:52 yellowfin pgpool[27909]: read_status_file: 2 th backend is set to down status
Dec 12 17:31:52 yellowfin pgpool[27909]: pgpool-II successfully started. version 3.2.1 (namameboshi)
Dec 12 17:31:52 yellowfin pgpool[28110]: connection received: host=app01 port=50191
...
Dec 12 17:32:16 yellowfin pgpool[27965]: pool_send_and_wait: Error or notice message from backend: : DB node id: 1 backend pid: 8851 statement: UPDATE ... message: cannot execute UPDATE in a read-only transaction
...

tuomas

2012-12-13 07:29

reporter   ~0000182

Is commit dbfa5e95bb2a60f5604857b7a3575d87bc270b7c related to the first issue reported here? Would it fix it?

tuomas

2012-12-15 02:56

reporter   ~0000187

I tried the commit (dbfa5e95bb2a60f5604857b7a3575d87bc270b7c), but was able to reproduce the problem with it as well.

I managed to reproduce it by adding artificial lag on a SLAVE server with:

slave# tc qdisc add dev eth0 root netem delay 8000ms

What happened is that all health checks to the MASTER started to expire.

pgpool[28820]: connect_inet_domain_socket: connect() failed: Network is unreachable
pgpool[28820]: connection to slave(5432) failed
pgpool[28820]: new_connection: create_cp() failed
pgpool[28820]: degenerate_backend_set: 0 fail over request from pid 28820
pgpool[27840]: pool_read: read failed (Connection reset by peer)
kernel: [2511450.273362] pgpool[27840]: segfault at 8 ip 000000000040978f sp 00007fffb3be9970 error 4 in pgpool[400000+f0000]
pgpool[23725]: health_check: health check timer has been already expired before attempting to connect to 1 th backend
pgpool[23725]: health check retry sleep time: 1 second(s)
pgpool[23725]: starting degeneration. shutdown host slave(5432)
pgpool[23725]: Restart all children
pgpool[23725]: execute command: /usr/local/bin/pg_failover.sh 0 master /etc/pg_master_trigger 1 0 1
pgpool[23725]: find_primary_node_repeatedly: waiting for finding a primary node
pgpool[23725]: connect_inet_domain_socket_by_port: health check timer expired
pgpool[23725]: make_persistent_db_connection: connection to master(5432) failed
pgpool[23725]: find_primary_node: make_persistent_connection failed
pgpool[23725]: connect_inet_domain_socket_by_port: health check timer expired
pgpool[23725]: make_persistent_db_connection: connection to master(5432) failed
pgpool[23725]: find_primary_node: make_persistent_connection failed
... loop repeats until pgpool restart...

There was also that segmentation fault:
#0 0x000000000040978f in send_frontend_exits () at child.c:1335
0000001 0x000000000040a20a in child_exit (code=1) at child.c:1411
0000002 0x000000000041cdce in pool_read (cp=0x104d320, buf=<optimized out>, len=1) at pool_stream.c:175
0000003 0x0000000000409095 in s_do_auth (cp=0x1050110, password=0x1043611 "<filtered>") at child.c:1604
0000004 0x000000000040ae18 in make_persistent_db_connection (hostname=<optimized out>, port=<optimized out>, dbname=0x47a02b "postgres", user=0x1043ef1 "pgpool", password=0x1043611 "<filtered>",
    retry=<optimized out>) at child.c:1543
0000005 0x0000000000455220 in establish_persistent_connection () at pool_worker_child.c:163
0000006 do_worker_child () at pool_worker_child.c:133
0000007 0x000000000040729f in worker_fork_a_child () at main.c:1283
0000008 0x0000000000407bde in reaper () at main.c:2284
0000009 reaper () at main.c:2225
0000010 0x000000000040829d in pool_sleep (second=<optimized out>) at main.c:2500
0000011 0x0000000000405a38 in main (argc=<optimized out>, argv=<optimized out>) at main.c:845

tuomas

2012-12-15 10:30

reporter   ~0000189

I don't quite understand why the health check timer is global to all backends. Probably the problem is that health_check_timer_expired is not reset to 0 when "health check timer has been already expired before attempting to connect to 1 th backend" and it then ends up to failover() and find_primary_node() with it and fails instantly.

anzai

2012-12-18 13:38

developer   ~0000197

The following bug was fixed at Tatsuo's commit of 231b292763c68ec927f03362bb8bce8d6fb75a73.
http://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=231b292763c68ec927f03362bb8bce8d6fb75a73

----
pgpool[27840]: pool_read: read failed (Connection reset by peer)
kernel: [2511450.273362] pgpool[27840]: segfault at 8 ip 000000000040978f sp 00007fffb3be9970 error 4 in pgpool[400000+f0000]
pgpool[23725]: health_check: health check timer has been already expired before attempting to connect to 1 th backend
----

I'm debugging about failing to find the new primary. Please wait a bit.

tuomas

2013-01-15 21:28

reporter   ~0000213

Any updates on this? Were you able to reproduce?

anzai

2013-01-31 18:49

developer   ~0000222

Sorry, I'm trying to reproduce now...
Could you show me your pgpool.conf, pg_failover.sh and each recovery.conf?

What I want to know are

- Does it ocuurs at faiover to 1st standby, or at failover to 2nd standby after 1st failover?
- Do 2 standbys have the same primary_conninfo in their recovery.conf?
  Whose streaming logs do the remaining servers try to receive?
- health check's period, retry, delay
- delay_threshold
- detail of steps as possible


> pgpool was started to a state where 2 out of 3 servers (master and slave) were down. pgpool failed to make the last available server a master (it didn't even try), but tried to send write requests to the slave server:

Did you start pgpool with "-D " option (discard pgpool_status file) and without health check at that time?
If you set health_check_period 0 (off), pgpool cannot recognize down server is down and continues trying to connect forever with the following log messages.

----
2013-01-31 18:09:53 ERROR: pid 11479: connect_inet_domain_socket: connect() failed: Connection refused
2013-01-31 18:09:53 ERROR: pid 11479: make_persistent_db_connection: connection to localhost(15432) failed
2013-01-31 18:09:53 ERROR: pid 11479: check_replication_time_lag: could not connect to DB node 0, check sr_check_user and sr_check_password
----

tuomas

2013-03-13 18:44

reporter   ~0000240

all standbys have the same recovery.confs, those are automatically generated. all slaves follow the master. pgpool is never started with -D and health_check is always enabled.

health_check_period = 10
health_check_timeout = 20
health_check_max_retries = 3
health_check_retry_delay = 1

delay_threshold = 1
load_balance_mode = off

---

There's been a couple of issues since this one.
 - 3 servers, node0: slave, node1: slave, node2:primary -> When both of the slaves dropped of due to network maintenance, pgpool lost primary completely and wasn't able to recover, although the current primary was up and running without any issues all the time. Before that both slaves were manually restarted one at a time and that pgpool was able to handle just fine.

 - 1 slave had network issues (node0:slave, node1: slave, node2:primary), it resulted in this:

connection to node0(5432) failed
new_connection: create_cp() failed
degenerate_backend_set: 0 fail over request from pid 3925
starting degeneration. shutdown host node0(5432)
Restart all children
find_primary_node_repeatedly: waiting for finding a primary node
find_primary_node: primary node id is 2
failover: set new primary node: 2
failover: set new master node: 1
failover done. shutdown host node0(5432)
worker process received restart request
pool_read_kind: kind does not match between master(0) slot[0] (52)
pool_read_kind: kind does not match between master(0) slot[0] (52)
pool_read_kind: kind does not match between master(0) slot[0] (52)
pool_read_kind: kind does not match between master(0) slot[0] (52)
pcp child process received restart request
worker child 25227 exits with status 256
fork a new worker child pid 23674
pool_read_kind: kind does not match between master(0) slot[0] (52)
pool_read_kind: kind does not match between master(0) slot[0] (52)
pool_read_kind: kind does not match between master(0) slot[0] (52)
<all new connections/queries keep failing>

t-ishii

2014-11-14 17:22

developer   ~0000493

Probably fixed by:
http://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit h=13d8da7f97a31e29e38972fceaf5b173a7ac430d

Issue History

Date Modified Username Field Change
2012-12-11 00:33 tuomas New Issue
2012-12-13 07:11 tuomas Note Added: 0000180
2012-12-13 07:29 tuomas Note Added: 0000182
2012-12-13 11:27 anzai Assigned To => anzai
2012-12-13 11:27 anzai Status new => assigned
2012-12-15 02:56 tuomas Note Added: 0000187
2012-12-15 10:30 tuomas Note Added: 0000189
2012-12-18 13:38 anzai Note Added: 0000197
2013-01-15 21:28 tuomas Note Added: 0000213
2013-01-31 18:49 anzai Note Added: 0000222
2013-03-13 18:44 tuomas Note Added: 0000240
2014-11-14 17:22 t-ishii Note Added: 0000493
2014-11-14 17:24 t-ishii Status assigned => resolved