[pgpool-general: 1628] pgpool process dies inexplicably

Herouth Maoz herouth at unicell.co.il
Wed Apr 17 19:23:55 JST 2013


Hello.

I am testing pgpool for high availability. I succeeded in setting up pgpool using the packages that come with current Debian (v 3.1), and streaming replication, and it worked well in the testing environment. 

However, now I want to set up two pgpools that use watchdog to connect to each other. To do this I upgraded the Debian packages to v. 3.2, and everything seemed OK - until I decided to test server failover again. That is, the watchdog setup works, but streaming replication with pgpool fails.

I then disabled the watchdog and tried to do what succeeded in version 3.1 - just control two servers with one pgpool and no watchdog. The test I'm trying to do is bring pgpool up when one of the servers is down, and the other is active. My goal is for pgpool to see that it has one active server, and then to allow me to do a base backup from the live server to the "dead" server and start replicating again.

However, the main pgpool process dies immediately as I bring it up. All its subprocesses become attached to process #1.

This is what I see in the log (I set debug_level to 5):

Apr 17 13:00:27 pool01 pgpool: 2013-04-17 13:00:27 DEBUG: pid 14023: pool_initialize_private_backend_status: initialize backend status
Apr 17 13:00:27 pool01 pgpool: 2013-04-17 13:00:27 LOG:   pid 13983: pgpool-II successfully started. version 3.2.2 (namameboshi)
Apr 17 13:00:27 pool01 pgpool: 2013-04-17 13:00:27 DEBUG: pid 14025: I am 14025
Apr 17 13:00:27 pool01 pgpool: 2013-04-17 13:00:27 DEBUG: pid 14025: pool_initialize_private_backend_status: initialize backend status
Apr 17 13:00:27 pool01 pgpool: 2013-04-17 13:00:27 DEBUG: pid 13983: pool_ssl: SSL requested but SSL support is not available
Apr 17 13:00:27 pool01 pgpool: 2013-04-17 13:00:27 ERROR: pid 13983: pool_flush_it: write failed to backend (0). reason: Connection refused offset: 0 wlen: 41
Apr 17 13:00:27 pool01 pgpool: 2013-04-17 13:00:27 LOG:   pid 13983: degenerate_backend_set: 0 fail over request from pid 13983
Apr 17 13:00:27 pool01 pgpool: 2013-04-17 13:00:27 DEBUG: pid 14025: pool_ssl: SSL requested but SSL support is not available
Apr 17 13:00:27 pool01 pgpool: 2013-04-17 13:00:27 ERROR: pid 14025: pool_flush_it: write failed to backend (0). reason: Connection refused offset: 0 wlen: 41
Apr 17 13:00:27 pool01 pgpool: 2013-04-17 13:00:27 LOG:   pid 14025: degenerate_backend_set: 0 fail over request from pid 14025

At this point, listing the processes, I see:

UID        PID  PPID  C STIME TTY          TIME CMD
postgres 13984     1  0 13:00 pts/1    00:00:00 logger -t pgpool -p local0.info
postgres 13992     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 13993     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 13994     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 13995     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 13996     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 13997     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 13998     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 13999     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14000     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14001     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14002     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14003     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14004     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14005     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14006     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14007     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14008     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14009     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14010     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14011     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14012     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14013     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14014     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14015     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14016     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14017     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14018     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14019     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14020     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14021     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14022     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14023     1  0 13:00 pts/1    00:00:00 pgpool: wait for connection request
postgres 14024     1  0 13:00 pts/1    00:00:00 pgpool: PCP: wait for connection request

As you can see, process 14025 has died, and all its children are now children of process 1.

Attempting to psql (pgpool is at port 5432 so I don't need to add parameters):

postgres at pool01:/etc/pgpool2$ psql
psql: postgres at pool01:/etc/pgpool2$ 

So it just prints "psql:" without newline and then dies. The exit status is 2.

This gives me the following in the log:

Apr 17 13:14:20 pool01 pgpool: 2013-04-17 13:14:20 DEBUG: pid 14022: I am 14022 accept fd 6
Apr 17 13:14:20 pool01 pgpool: 2013-04-17 13:14:20 LOG:   pid 14022: connection received: host=[local]
Apr 17 13:14:20 pool01 pgpool: 2013-04-17 13:14:20 DEBUG: pid 14022: read_startup_packet: application_name: psql
Apr 17 13:14:20 pool01 pgpool: 2013-04-17 13:14:20 DEBUG: pid 14022: Protocol Major: 3 Minor: 0 database: postgres user: postgres
Apr 17 13:14:20 pool01 pgpool: 2013-04-17 13:14:20 DEBUG: pid 14022: new_connection: connecting 0 backend
Apr 17 13:14:20 pool01 pgpool: 2013-04-17 13:14:20 DEBUG: pid 14022: new_connection: connecting 1 backend
Apr 17 13:14:20 pool01 pgpool: 2013-04-17 13:14:20 DEBUG: pid 14022: pool_ssl: SSL requested but SSL support is not available
Apr 17 13:14:20 pool01 pgpool: 2013-04-17 13:14:20 ERROR: pid 14022: pool_flush_it: write failed to backend (0). reason: Connection refused offset: 0 wlen: 84
Apr 17 13:14:20 pool01 pgpool: 2013-04-17 13:14:20 LOG:   pid 14022: degenerate_backend_set: 0 fail over request from pid 14022
Apr 17 13:14:20 pool01 pgpool: 2013-04-17 13:14:20 ERROR: pid 14022: pool_flush_it: write failed to backend (0). reason: Broken pipe offset: 0 wlen: 5

And after this, process 14022 is also dead.

What could be the problem here? Why doesn't it see that it has one backend and just connect to it? This is how it should work if in the past it had failed over to the second backend. By the way, here is what I get when I try to show the status of the backends:

postgres at pool01:/etc/pgpool2$ /usr/sbin/pcp_node_info 10 localhost 9898 postgres pass123 0
pstgr01 5432 2 0.500000
postgres at pool01:/etc/pgpool2$ /usr/sbin/pcp_node_info 10 localhost 9898 postgres pass123 1
pstgr02 5432 2 0.500000

But this seems to say that he thinks both backends are OK!

Can anybody explain what is happening or what I should look for?

Thank you,
Herouth
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130417/95134a69/attachment-0001.html>


More information about the pgpool-general mailing list