[pgpool-hackers: 4038] Re: PgPool automatically performs failover after few mins without any activity

Thu Oct 7 16:42:54 JST 2021

Just FYI, I also asked one of my colleagues to test it. They set it up without Prometheus or alertmanager. Just ran the commands given in the tutorial. They observed the same behaviour.

Starting Pgpool-II...
2021-10-07 07:18:39: pid 20: LOG:  Backend status file /tmp/pgpool_status does not exist
2021-10-07 07:18:39: pid 20: LOG:  health_check_stats_shared_memory_size: requested size: 12288
2021-10-07 07:18:39: pid 20: LOG:  memory cache initialized
2021-10-07 07:18:39: pid 20: DETAIL:  memcache blocks :64
2021-10-07 07:18:39: pid 20: LOG:  allocating (136555320) bytes of shared memory segment
2021-10-07 07:18:39: pid 20: LOG:  allocating shared memory segment of size: 136555320
2021-10-07 07:18:39: pid 20: LOG:  health_check_stats_shared_memory_size: requested size: 12288
2021-10-07 07:18:39: pid 20: LOG:  health_check_stats_shared_memory_size: requested size: 12288
2021-10-07 07:18:39: pid 20: LOG:  memory cache initialized
2021-10-07 07:18:39: pid 20: DETAIL:  memcache blocks :64
2021-10-07 07:18:39: pid 20: LOG:  pool_discard_oid_maps: discarded memqcache oid maps
2021-10-07 07:18:39: pid 20: LOG:  Setting up socket for 0.0.0.0:9999
2021-10-07 07:18:39: pid 20: LOG:  Setting up socket for :::9999
2021-10-07 07:18:39: pid 20: LOG:  find_primary_node_repeatedly: waiting for finding a primary node
2021-10-07 07:18:39: pid 20: LOG:  find_primary_node: primary node is 0
2021-10-07 07:18:39: pid 20: LOG:  find_primary_node: standby node is 1
2021-10-07 07:18:39: pid 54: LOG:  PCP process: 54 started
2021-10-07 07:18:39: pid 56: LOG:  process started
2021-10-07 07:18:39: pid 55: LOG:  process started
2021-10-07 07:18:39: pid 57: LOG:  process started
2021-10-07 07:18:39: pid 20: LOG:  pgpool-II successfully started. version 4.2.3 (chichiriboshi)
2021-10-07 07:18:39: pid 20: LOG:  node status[0]: 1
2021-10-07 07:18:39: pid 20: LOG:  node status[1]: 2
2021-10-07 07:18:42: pid 52: LOG:  pool_reuse_block: blockid: 0
2021-10-07 07:18:42: pid 52: CONTEXT:  while searching system catalog, When relcache is missed
2021-10-07 07:19:49: pid 53: LOG:  status_changed_time 1633591122
2021-10-07 07:19:49: pid 53: LOG:  status_changed_time 1633591122
2021-10-07 07:19:59: pid 56: ERROR:  authentication failed
2021-10-07 07:19:59: pid 56: DETAIL:  password authentication failed for user "postgres"
2021-10-07 07:19:59: pid 56: LOG:  health check failed on node 0 (timeout:0)
2021-10-07 07:19:59: pid 56: LOG:  received degenerate backend request for node_id: 0 from pid [56]
2021-10-07 07:19:59: pid 20: LOG:  Pgpool-II parent process has received failover request
2021-10-07 07:19:59: pid 20: LOG:  starting degeneration. shutdown host primary(5432)
2021-10-07 07:19:59: pid 20: LOG:  Restart all children
2021-10-07 07:19:59: pid 20: LOG:  find_primary_node_repeatedly: waiting for finding a primary node
2021-10-07 07:19:59: pid 20: LOG:  find_primary_node: standby node is 1
2021-10-07 07:20:00: pid 20: LOG:  find_primary_node: standby node is 1
2021-10-07 07:20:01: pid 20: LOG:  find_primary_node: standby node is 1
2021-10-07 07:20:02: pid 20: LOG:  find_primary_node: standby node is 1
2021-10-07 07:20:03: pid 20: LOG:  find_primary_node: standby node is 1
2021-10-07 07:20:05: pid 20: LOG:  find_primary_node: standby node is 1
2021-10-07 07:20:06: pid 20: LOG:  find_primary_node: standby node is 1
2021-10-07 07:20:07: pid 20: LOG:  find_primary_node: standby node is 1

Cheers
Anirudh
On 7 Oct 2021, 9:27 AM +0200, anirudh.goel at cern.ch, wrote:
> Hello Bo
>
> Thank you for your quick response.
>
> I am not using Kubernetes for the setup. I just followed this tutorial and ran the docker commands without any additional / modified configuration - https://b-peng.blogspot.com/2021/07/deploying-pgpool2-exporter-with-docker.html.
>
> I am doing the whole setup using docker-compose and have attached the docker-compose.yml file herewith.
>
> Cheers
> Anirudh
> On 7 Oct 2021, 7:23 AM +0200, Bo Peng <pengbo at sraoss.co.jp>, wrote:
> > Hello,
> >
> > > I was following this amazing tutorial by Bo Peng to setup a pgpool cluster with pgpool exporter. The setup went fine and I was able to run the exporter successfully, but on further testing, I found something strange. The setup works fine for a few mins, but after that, even without any activity on the primary or standby, pgpool throws an "authentication failed, password authentication failed for user “postgres”” error and performs a failover. Here are the logs-
> > >
> > > Configuring Pgpool-II...
> > > No custom pgpool.conf detected. Use environment variables and default config.
> > > Starting Pgpool-II...
> > > 2021-10-06 17:19:27: pid 20: LOG: Backend status file /tmp/pgpool_status does not exist
> > > 2021-10-06 17:19:27: pid 20: LOG: health_check_stats_shared_memory_size: requested size: 12288
> > > 2021-10-06 17:19:27: pid 20: LOG: memory cache initialized
> > > 2021-10-06 17:19:27: pid 20: DETAIL: memcache blocks :64
> > > 2021-10-06 17:19:27: pid 20: LOG: allocating (136555320) bytes of shared memory segment
> > > 2021-10-06 17:19:27: pid 20: LOG: allocating shared memory segment of size: 136555320
> > > 2021-10-06 17:19:27: pid 20: LOG: health_check_stats_shared_memory_size: requested size: 12288
> > > 2021-10-06 17:19:27: pid 20: LOG: health_check_stats_shared_memory_size: requested size: 12288
> > > 2021-10-06 17:19:27: pid 20: LOG: memory cache initialized
> > > 2021-10-06 17:19:27: pid 20: DETAIL: memcache blocks :64
> > > 2021-10-06 17:19:27: pid 20: LOG: pool_discard_oid_maps: discarded memqcache oid maps
> > > 2021-10-06 17:19:27: pid 20: LOG: Setting up socket for 0.0.0.0:9999
> > > 2021-10-06 17:19:27: pid 20: LOG: Setting up socket for :::9999
> > > 2021-10-06 17:19:27: pid 20: LOG: find_primary_node_repeatedly: waiting for finding a primary node
> > > 2021-10-06 17:19:28: pid 20: LOG: find_primary_node: primary node is 0
> > > 2021-10-06 17:19:28: pid 20: LOG: find_primary_node: standby node is 1
> > > 2021-10-06 17:19:28: pid 54: LOG: PCP process: 54 started
> > > 2021-10-06 17:19:28: pid 57: LOG: process started
> > > 2021-10-06 17:19:28: pid 56: LOG: process started
> > > 2021-10-06 17:19:28: pid 55: LOG: process started
> > > 2021-10-06 17:19:28: pid 20: LOG: pgpool-II successfully started. version 4.2.3 (chichiriboshi)
> > > 2021-10-06 17:19:28: pid 20: LOG: node status[0]: 1
> > > 2021-10-06 17:19:28: pid 20: LOG: node status[1]: 2
> > > 2021-10-06 17:19:28: pid 47: LOG: pool_reuse_block: blockid: 0
> > > 2021-10-06 17:19:28: pid 47: CONTEXT: while searching system catalog, When relcache is missed
> > > 2021-10-06 17:19:43: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:19:43: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:19:53: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:19:53: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:20:03: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:20:03: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:20:13: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:20:13: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:20:23: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:20:23: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:20:33: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:20:33: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:20:43: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:20:43: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:20:53: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:20:53: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:21:03: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:21:03: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:21:13: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:21:13: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:21:23: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:21:23: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:21:33: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:21:33: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:21:43: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:21:43: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:21:48: pid 57: ERROR: authentication failed
> > > 2021-10-06 17:21:48: pid 57: DETAIL: password authentication failed for user "postgres"
> > > 2021-10-06 17:21:48: pid 57: LOG: health check failed on node 1 (timeout:0)
> >
> > Are you following this configuration example?
> >
> > https://www.pgpool.net/docs/latest/en/html/example-kubernetes.html
> >
> > It seems that the health check process failed.
> > If you are running your cluster on K8s, you need to
> > disable Pgpool-II's health check:
> >
> > - name: PGPOOL_PARAMS_HEALTH_CHECK_PERIOD
> > value: "0"
> >
> > By default health check is disabled.
> > How did you set the environment variables?
> >
> > > 2021-10-06 17:21:48: pid 57: LOG: received degenerate backend request for node_id: 1 from pid [57]
> > > 2021-10-06 17:21:48: pid 20: LOG: Pgpool-II parent process has received failover request
> > > 2021-10-06 17:21:48: pid 20: LOG: starting degeneration. shutdown host postgres_replica(5432)
> > > 2021-10-06 17:21:48: pid 20: LOG: Do not restart children because we are switching over node id 1 host: postgres_replica port: 5432 and we are in streaming replication mode
> > > 2021-10-06 17:21:48: pid 20: LOG: failover: set new primary node: 0
> > > 2021-10-06 17:21:48: pid 20: LOG: failover: set new main node: 0
> > > failover done. shutdown host postgres_replica(5432)2021-10-06 17:21:48: pid 20: LOG: failover done. shutdown host postgres_replica(5432)
> > > 2021-10-06 17:21:48: pid 55: LOG: worker process received restart request
> > > 2021-10-06 17:21:49: pid 54: LOG: restart request received in pcp child process
> > > 2021-10-06 17:21:49: pid 20: LOG: PCP child 54 exits with status 0 in failover()
> > > 2021-10-06 17:21:49: pid 20: LOG: fork a new PCP child pid 58 in failover()
> > > 2021-10-06 17:21:49: pid 20: LOG: worker child process with pid: 55 exits with status 256
> > > 2021-10-06 17:21:49: pid 58: LOG: PCP process: 58 started
> > > 2021-10-06 17:21:49: pid 20: LOG: fork a new worker child process with pid: 59
> > > 2021-10-06 17:21:49: pid 59: LOG: process started
> > > 2021-10-06 17:21:53: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:21:53: pid 47: LOG: status_changed_time 1633540908
> > > 2021-10-06 17:22:03: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:22:03: pid 47: LOG: status_changed_time 1633540908
> > > 2021-10-06 17:22:13: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:22:13: pid 47: LOG: status_changed_time 1633540908
> > > 2021-10-06 17:22:23: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:22:23: pid 47: LOG: status_changed_time 1633540908
> > > 2021-10-06 17:22:33: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:22:33: pid 47: LOG: status_changed_time 1633540908
> > > 2021-10-06 17:22:43: pid 47: LOG: status_changed_time 1633540768
> > > 2021-10-06 17:22:43: pid 47: LOG: status_changed_time 1633540908
> > > 2021-10-06 17:22:53: pid 47: LOG: status_changed_time 1633540768
> > >
> > > As you will see in the logs, there’s no activity on my side and I have tried this multiple times and it always shows the same behaviour. I didn’t change anything in the configuration from Bo’s tutorial. Though I have added an additional Prometheus and Alertmanager in front of it. Here’s the configuration for those-
> > >
> > > prometheus:
> > >  restart: unless-stopped
> > >  ports:
> > >  - "9090:9090"
> > >  volumes:
> > >  - ./prometheus/config.yml:/etc/prometheus/prometheus.yml
> > >  - ./prometheus/alert_rules.yml:/etc/prometheus/alert_rules/alert_rules.yml
> > >  depends_on:
> > >  - pgpool2_exporter
> > >  image: "prom/prometheus"
> > >
> > >  alertmanager:
> > >  restart: unless-stopped
> > >  volumes:
> > >  - ./prometheus-alertmanager/config.yml:/etc/alertmanager/alertmanager.yml
> > >  depends_on:
> > >  - prometheus
> > >  image: "quay.io/prometheus/alertmanager"
> > >
> > >
> > > I don’t think prometheus or the alertmanager are causing any problems but I am unable to figure out why pgpool is doing a failover after several minutes of running fine.
> > >
> > > Thanks in advance!
> > >
> > > Anirudh
> >
> >
> > --
> > Bo Peng <pengbo at sraoss.co.jp>
> > SRA OSS, Inc. Japan
> > http://www.sraoss.co.jp/
> _______________________________________________
> pgpool-hackers mailing list
> pgpool-hackers at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20211007/c30b89b6/attachment-0001.htm>