[Pgpool-general] Reloading the server while under load and recovering the degenerated node

Mon Feb 23 22:14:26 UTC 2009

Hi Jaume,

After reading the below I decided to test the reload configuration  
functionality,
whether doing so through pgpool using pg_reload_conf function or by  
reloading using the init.d script on the master node.
By using the init.d script to reload the conf I already expect pgpool  
to not like that and degenerate nodes.

On my personal opinion I would not reload the configuration using the  
init.d script on a particular node since this is a replication  
environment.
Also the configuration change would be done on both nodes but I see  
your point and I tested it just modifying one configuration file.

I've tested this on both pgpool2 version 2.1 (cvs 20090120) and also  
on pgpool2.2 RC1 (patched).
At first, I thought that pgpool2 was not handling the process of  
reloading the configuration on both nodes too well.
But then I remembered a few things and realized that what I was  
expecting to happen was wrong and I should not  expect so.

1) I knew that a "SELECT pg_reload_conf()"  would only run on one node

2) If I had the "SELECT pg_reload_conf()" wrapped within a transaction  
I was expecting it to run on both nodes (WRONG).
      Since it's a SELECT it only runs on one node, the transaction  
block though is indeed opened on both nodes.

Then I remembered (from a previous email by Tatsuo) that in order for  
me to make that function run on both nodes I should add the / 
*REPLICATION*/ comment before the SELECT.

So, If I change the "log_statement" parameter on one of the nodes,  
either node 0 or node 1, and reload the configuration file
of both nodes through pgpool using the command below, everything works  
ok and nothing gets degenerated.

/*REPLICATION*/SELECT pg_reload_conf();

In regards to online recovery, it works fine to me although I haven't  
tried do it while using pgbench in a while.
I have though tested it while pgpool/DBs have quite a bit of load  
created by our in-house application.

Marcelo
PostgreSQL DBA
Linux/Solaris System Administrator

On Feb 23, 2009, at 5:47 AM, Jaume Sabater wrote:

> Hi everyone!
>
> I am opening this new thread to deal with this problem. It's now
> pgpool-II 2.2 RC2 under Debian GNU/Linux Lenny AMD64 with PostgreSQL
> 8.3.6. Two nodes, pgpool-II executing in node 0, which is the master
> node at the moment.
>
> Sentences executed from my laptop:
>
> dropdb -h postgresql -p 9999 -U pgpool2 bench_replication
> createdb -h postgresql -p 9999 -U pgpool2 bench_replication
> createlang -h postgresql -p 9999 -U pgpool2 -d bench_replication  
> plpgsql
> pgbench -i -h postgresql -p 9999 -U pgpool2 -d bench_replication
> pgbench -h postgresql -p 9999 -U pgpool2 -d bench_replication -c 2 - 
> t 100
>
> During the execution of the last sentence, I am reloading
> configuration in the master node. The change is in file
> /etc/postgresql/8.3/main/postgresql.conf and consists of turning
> log_statement from "all" to "none". This is what appears in the syslog
> (from pgpool-II):
>
> Feb 23 12:12:37 pgsql1 pgpool: ERROR: pid 2632: pool_process_query: 1
> th kind C does not match with master or majority connection kind S
> Feb 23 12:12:37 pgsql1 pgpool: ERROR: pid 2632: kind mismatch among
> backends. Possible last query was: "BEGIN;" kind details are: 0[S]
> 1[C]
> Feb 23 12:12:37 pgsql1 pgpool: LOG:   pid 2632: notice_backend_error:
> 1 fail over request from pid 2632
> Feb 23 12:12:37 pgsql1 pgpool: LOG:   pid 1116: starting degeneration.
> shutdown host pgsql2.freyatest.domain(5432)
> Feb 23 12:12:37 pgsql1 pgpool: LOG:   pid 1116: execute command:
> /var/lib/postgresql/8.3/main/pgpool-failover 1 pgsql2.freyatest.domain
> 5432 /var/lib/postgres
> ql/8.3/main 0 0
> Feb 23 12:12:37 pgsql1 pgpool[2887]: Executing pgpool-failover as  
> user postgres
> Feb 23 12:12:37 pgsql1 pgpool[2888]: Failover of node 1 at hostname
> pgsql2.freyatest.domain. New master node is 0. Old master node was 0.
> Feb 23 12:12:37 pgsql1 pgpool: LOG:   pid 1116: failover_handler: set
> new master node: 0
> Feb 23 12:12:37 pgsql1 pgpool: LOG:   pid 1116: failover done.
> shutdown host pgsql2.freyatest.domain(5432)
>
>
> Please note the slave node is being shut down, not the master node
> (which is the one that was being reloaded).
>
> This is seen in the PostgreSQL log of the master node (the one being  
> reloaded):
>
> [..]
>
> 2009-02-23 12:12:37 CET LOG:  statement: BEGIN;
> 2009-02-23 12:12:37 CET LOG:  statement: INSERT INTO history (tid,
> bid, aid, delta, mtime) VALUES (5, 1, 63021, -3004,
> CURRENT_TIMESTAMP);
> 2009-02-23 12:12:37 CET LOG:  statement: UPDATE accounts SET abalance
> = abalance + 2213 WHERE aid = 82551;
> 2009-02-23 12:12:37 CET LOG:  statement: SELECT abalance FROM accounts
> WHERE aid = 82551;
> 2009-02-23 12:12:37 CET LOG:  statement: UPDATE tellers SET tbalance =
> tbalance + 2213 WHERE tid = 9;
> 2009-02-23 12:12:37 CET LOG:  statement: UPDATE branches SET bbalance
> = bbalance + 2213 WHERE bid = 1;
> 2009-02-23 12:12:37 CET LOG:  incomplete startup packet
> 2009-02-23 12:12:37 CET LOG:  statement: END;
> 2009-02-23 12:12:37 CET LOG:  received SIGHUP, reloading  
> configuration files
>
>
> And this is from the PostgreSQL log on the slave node:
>
> [..]
> 2009-02-23 12:12:37 CET LOG:  statement: BEGIN;
> 2009-02-23 12:12:37 CET LOG:  statement: INSERT INTO history (tid,
> bid, aid, delta, mtime) VALUES (5, 1, 63021, -3004,
> CURRENT_TIMESTAMP);
> 2009-02-23 12:12:37 CET LOG:  statement: UPDATE accounts SET abalance
> = abalance + 2213 WHERE aid = 82551;
> 2009-02-23 12:12:37 CET LOG:  statement: END;
> 2009-02-23 12:12:37 CET LOG:  statement: UPDATE tellers SET tbalance =
> tbalance + 2213 WHERE tid = 9;
> 2009-02-23 12:12:37 CET LOG:  statement: UPDATE branches SET bbalance
> = bbalance + 2213 WHERE bid = 1;
> 2009-02-23 12:12:37 CET LOG:  statement: BEGIN;
> 2009-02-23 12:12:37 CET LOG:  unexpected EOF on client connection
> 2009-02-23 12:12:37 CET LOG:  unexpected EOF on client connection
> 2009-02-23 12:12:37 CET LOG:  unexpected EOF on client connection
> 2009-02-23 12:12:37 CET LOG:  unexpected EOF on client connection
> 2009-02-23 12:12:37 CET LOG:  unexpected EOF on client connection
> 2009-02-23 12:12:37 CET LOG:  unexpected EOF on client connection
> 2009-02-23 12:12:37 CET LOG:  unexpected EOF on client connection
> 2009-02-23 12:12:37 CET LOG:  unexpected EOF on client connection
> 2009-02-23 12:12:37 CET LOG:  unexpected EOF on client connection
> 2009-02-23 12:12:37 CET LOG:  unexpected EOF on client connection
>
> PCP says this:
>
> # pcp_node_info 5 postgresql 9898 root <password> 0
> pgsql1.freyatest.domain 5432 2 1073741823.500000
>
> # pcp_node_info 5 postgresql 9898 root <password> 1
> pgsql2.freyatest.domain 5432 3 1073741823.500000
>
> So it is, in fact, the slave node being degenerated, which was NOT
> reloaded. I think this is because the error was triggered by the
> directive "replication_stop_on_mismatch = true", which leads pgpool-II
> to always staying with the master node.
>
> This is the output of the last command executed in my laptop:
>
> [..]
> client 0 sending BEGIN;
> client 1 receiving
> client 1 sending INSERT INTO history (tid, bid, aid, delta, mtime)
> VALUES (9, 1, 82551, 2213, CURRENT_TIMESTAMP);
> client 0 receiving
> Client 0 aborted in state 7: ERROR:  kind mismatch among backends.
> Possible last query was: "BEGIN;" kind details are: 0[S] 1[C]
> HINT:  check data consistency among db nodes
> client 1 receiving
> client 1 sending END;
> client 1 receiving
> Client 1 aborted in state 13. Probably the backend died while  
> processing.
> transaction type: TPC-B (sort of)
> scaling factor: 1
> number of clients: 2
> number of transactions per client: 100
> number of transactions actually processed: 32/200
> tps = 16.542050 (including connections establishing)
> tps = 16.646950 (excluding connections establishing)
>
>
> After this happened, I copied the above logs from the consoles and
> pasted them here. Then I stopped the PostgreSQL server on the slave
> node and next I tried to recover the degenerated node:
>
> $ pcp_recovery_node 5 postgresql 9898 root <password> 1
> BackendError
>
> And the process went fine, slave node is now part of the cluster.
>
> I've double-checked the PostgreSQL logs on the master node and it
> always says "received SIGHUP, reloading configuration files", still
> the above problem is happening.
>
> After this, I restarted the slave node. It happened that a user was
> launching some queries while I did that, so the node went down. This
> is the syslog of pgpool-II:
>
> Feb 23 12:26:05 pgsql1 pgpool: ERROR: pid 3346: pool_read: read failed
> (Connection reset by peer)
> Feb 23 12:26:05 pgsql1 pgpool: LOG:   pid 3346:
> ProcessFrontendResponse: failed to read kind from frontend. frontend
> abnormally exited
> Feb 23 12:29:27 pgsql1 pgpool: LOG:   pid 3343: notice_backend_error:
> 1 fail over request from pid 3343
> Feb 23 12:29:27 pgsql1 pgpool: LOG:   pid 3369: notice_backend_error:
> 1 fail over request from pid 3369
> Feb 23 12:29:27 pgsql1 pgpool: LOG:   pid 1116: starting degeneration.
> shutdown host pgsql2.freyatest.domain(5432)
> Feb 23 12:29:27 pgsql1 pgpool: LOG:   pid 1116: execute command:
> /var/lib/postgresql/8.3/main/pgpool-failover 1 pgsql2.freyatest.domain
> 5432 /var/lib/postgres
> ql/8.3/main 0 0
> Feb 23 12:29:27 pgsql1 pgpool[3422]: Executing pgpool-failover as  
> user postgres
> Feb 23 12:29:27 pgsql1 pgpool[3423]: Failover of node 1 at hostname
> pgsql2.freyatest.domain. New master node is 0. Old master node was 0.
> Feb 23 12:29:27 pgsql1 pgpool: LOG:   pid 1116: failover_handler: set
> new master node: 0
> Feb 23 12:29:27 pgsql1 pgpool: LOG:   pid 1116: failover done.
> shutdown host pgsql2.freyatest.domain(5432)
> Feb 23 12:29:27 pgsql1 pgpool: LOG:   pid 1116: failover: no backends
> are degenerated
>
>
> Okay, here we go again. I stopped the slave PostgreSQL and requested
> an online recovery, which failed. This is the syslog:
>
> Feb 23 12:33:36 pgsql1 pgpool: LOG:   pid 3017: starting recovering  
> node 1
> Feb 23 12:33:36 pgsql1 pgpool: LOG:   pid 3017: CHECKPOINT in the 1st
> stage done
> Feb 23 12:33:36 pgsql1 pgpool: LOG:   pid 3017: starting recovery
> command: "SELECT pgpool_recovery('base-backup',
> 'pgsql2.freyatest.domain', '/var/lib/postgresql/8.3/main')"
> Feb 23 12:33:36 pgsql1 pgpool[3508]: Executing base-backup as user  
> postgres
> Feb 23 12:33:36 pgsql1 pgpool[3509]: Executing pg_start_backup
> Feb 23 12:33:37 pgsql1 pgpool[3512]: Creating file recovery.conf
> Feb 23 12:33:37 pgsql1 pgpool[3513]: Rsyncing directory base
> Feb 23 12:33:37 pgsql1 pgpool[3516]: Rsyncing directory global
> Feb 23 12:33:37 pgsql1 pgpool[3519]: Rsyncing directory pg_clog
> Feb 23 12:33:37 pgsql1 pgpool[3522]: Rsyncing directory pg_multixact
> Feb 23 12:33:37 pgsql1 pgpool[3525]: Rsyncing directory pg_subtrans
> Feb 23 12:33:37 pgsql1 pgpool[3528]: Rsyncing directory pg_tblspc
> Feb 23 12:33:37 pgsql1 pgpool[3531]: Rsyncing directory pg_twophase
> Feb 23 12:33:37 pgsql1 pgpool[3534]: Rsyncing directory pg_xlog
> Feb 23 12:33:40 pgsql1 pgpool[3537]: Rsyncing file recovery.conf (with
> source deletion)
> Feb 23 12:33:40 pgsql1 pgpool[3540]: Executing pg_stop_backup
> Feb 23 12:33:41 pgsql1 pgpool: LOG:   pid 3017: 1st stage is done
> Feb 23 12:33:41 pgsql1 pgpool: LOG:   pid 3017: starting 2nd stage
> Feb 23 12:34:11 pgsql1 pgpool: LOG:   pid 3444: pool_process_query:
> child connection forced to terminate due to
> client_idle_limit_in_recovery(30) reached
> Feb 23 12:34:11 pgsql1 pgpool: LOG:   pid 3425: pool_process_query:
> child connection forced to terminate due to
> client_idle_limit_in_recovery(30) reached
> Feb 23 12:34:14 pgsql1 pgpool: ERROR: pid 3017:
> wait_connection_closed: existing connections did not close in 30 sec.
> Feb 23 12:34:14 pgsql1 pgpool: ERROR: pid 3017: start_recovery:
> timeover for waiting connection closed
>
>
> Users were using the database while I did the recovery which, in my
> opinion, made it fail. So I tried again, another recovery, and it
> failed again. Same error.
>
> And I tried again, and in this occasion it succeeded. There were no
> connections or queries on the PostgreSQL log while it happened, but I
> am still doubting about the real problem.
>
> So I went for something else: I repeated all the tests with pgbench
> and, again, did a reload of the master node while under load, but
> there were no changes on the postgresql.conf. And it failed again: the
> slave node was degenerated.
>
> I stopped the PostgreSQL in the slave node immediately and requested
> an online recovery. Luckily, no connections or queries were sent to
> the cluster while doing the recovery, and it went fine.
>
> Conclusion: I am still not sure what makes it fail, but it does. I am
> almost capable of reproducing the error, but it does not happen always
> (7 times out of 10, more or less). There is the "reloading while under
> load" issued, and there is the "not being able to recover the node"
> issued.
>
> Tatsuo, could you please try to do what I do here on your
> computer/servers and see what happens? I could understand that the
> /etc/init.d/postgresql-8.3 script of Debian does more than reloading,
> but the line in the PostgreSQL log says otherwise. And there is also
> this "wait_connection_closed: existing connections did not close in 30
> sec" error which I don't understand :'(
>
> Shall you need any more help with this, any more information, tests,
> whatever, please let me know.
>
> Thanks in advance.
>
> -- 
> Jaume Sabater
> http://linuxsilo.net/
>
> "Ubi sapientas ibi libertas"
> _______________________________________________
> Pgpool-general mailing list
> Pgpool-general at pgfoundry.org
> http://pgfoundry.org/mailman/listinfo/pgpool-general