[pgpool-hackers: 3961] Re: verify_backend_node_status does not recognise that the primary is connected to standbys

Sat Jul 10 22:12:52 JST 2021

Hey Tatsuo

The problem that I see with environment variables is that there might be multiple instances of postgresql running on a single host (as is the case in my organisation) and they might all be highly available and connected to pgpools. In this case, configuring different environment variables for the primary hostname may cause issues.

Does that makes sense?

Cheers!

Anirudh
On 10 Jul 2021, 5:23 PM +0530, Tatsuo Ishii <ishii at sraoss.co.jp>, wrote:
> Hi Anirudh,
>
> Ok. We can add a new parameter to those scripts in upcoming Pgpool-II
> 4.3. However I think it would be better to pass the hostname as an
> environment variable to the scripts because:
>
> 1) it's easier and cleaner for those who write the scripts.
>
> 2) it is easier to add new parameters to those scripts in the future.
>
> What do you think?
>
> > Hey Tatsuo
> >
> > Sorry if I was unclear and yes, you’re right. I would like to have the primary node host name (as it is mentioned as backend_hostname in pgpool.conf) passed to the recovery_1st_stage command.
> >
> > Cheers!
> >
> > Anirudh
> > On 10 Jul 2021, 1:20 PM +0530, Tatsuo Ishii <ishii at sraoss.co.jp>, wrote:
> > > Hi Anirudth,
> > >
> > > I'm confused. I thought you wanted to have the primary node host name
> > > being passed to the recovery_1st_stage command, that is
> > > dbod-ag-pg03.ch (or dbod-hac-c02.ch?) in your case, no?
> > >
> > > > Dear Tatsuo
> > > >
> > > > No no, for my case and also for others, I think it would make sense to pass it the same as the relevant backend_hostname as is the norm for failover & follow_primary commands.
> > > >
> > > > Cheers!
> > > >
> > > > Anirudh
> > > > On 10 Jul 2021, 11:47 AM +0530, Tatsuo Ishii <ishii at sraoss.co.jp>, wrote:
> > > > > Hi Anirudh,
> > > > >
> > > > > > Dear Tatsuo
> > > > > >
> > > > > > Indeed, the error was what you mentioned. That different backend hostname in primary_conninfo was placed there after an Online Recovery. The reason for this is that the PRIMARY_NODE_HOST is not passed as a variable to the recovery_1st_stage command/script and therefore I was using $(hostname) to get this hostname but that returned the name of the host machine and not the relevant alias.
> > > > > >
> > > > > > I was able to fix this by deriving the appropriate hostname from the PRIMARY_NODE_PGDATA but shouldn’t this be an implicit feature of pgpool? Can we pass the PRIMARY_NODE_HOST as a variable to the recovery_1st_stage command?
> > > > >
> > > > > Ok, that would be convenient for users like you I agree. However I
> > > > > just want to make sure that I understand your requirement correctly:
> > > > > you like to have PRIMARY_NODE_HOST being same as backend_hostname2
> > > > > (dbod-ag-pg03.ch) in your case? Or is it possible that
> > > > > PRIMARY_NODE_HOST is not either dbod-ag-pg03.ch or dbod-hac-c02.ch?
> > > > >
> > > > > > Cheers!
> > > > > >
> > > > > > Anirudh
> > > > > > On 2 Jul 2021, 7:18 AM +0530, Tatsuo Ishii <ishii at sraoss.co.jp>, wrote:
> > > > > > > > Hello
> > > > > > > >
> > > > > > > > I have a setup with 1 primary and 2 async standby postgresql nodes behind 1 pgpool all running on a separate hosts, without watchdog.
> > > > > > > >
> > > > > > > > I believe I have done all the config required for pgpool to recognise that the standbys are connected to a single primary (for enabling the detach_false_primary feature) but it still doesn’t recognize it.
> > > > > > > >
> > > > > > > > Before getting into more details, I'll say that I have tried to pcp_detach_node and pcp_attach_node and also pcp_recovery_node in the hope that I was missing some step that would lead to pgpool not connecting to the standby properly but none of those helped.
> > > > > > > >
> > > > > > > > THE ERROR:
> > > > > > > > I see this log in pgpool:
> > > > > > > >
> > > > > > > > 2021-07-01 17:07:44: pid 559: LOG: verify_backend_node_status: primary 2 does not connect to standby 0
> > > > > > > > 2021-07-01 17:07:44: pid 559: LOG: verify_backend_node_status: primary 2 does not connect to standby 1
> > > > > > > > 2021-07-01 17:07:44: pid 559: LOG: verify_backend_node_status: primary 2 owns only 0 standbys out of 2
> > > > > > > > 2021-07-01 17:08:34: pid 559: LOG: verify_backend_node_status: primary 2 does not connect to standby 0
> > > > > > > > 2021-07-01 17:08:34: pid 559: LOG: verify_backend_node_status: primary 2 does not connect to standby 1
> > > > > > > > 2021-07-01 17:08:34: pid 559: LOG: verify_backend_node_status: primary 2 owns only 0 standbys out of 2
> > > > > > > > 2021-07-01 17:09:25: pid 559: LOG: verify_backend_node_status: primary 2 does not connect to standby 0
> > > > > > > > 2021-07-01 17:09:25: pid 559: LOG: verify_backend_node_status: primary 2 does not connect to standby 1
> > > > > > > > 2021-07-01 17:09:25: pid 559: LOG: verify_backend_node_status: primary 2 owns only 0 standbys out of 2
> > > > > > >
> > > > > > > I think the reason you are seeing these errors is that pgpool tries to
> > > > > > > compare backend_hostname2 (dbod-ag-pg03.ch) and backend_port2 (6600)
> > > > > > > with host and port (dbod-hac-c02.ch, 6600) in primary_conninfo. In
> > > > > > > this case port matches but hostname does not match and Pgpool-II
> > > > > > > regards that the standbys do not connect to the primary node,.
> > > > > > >
> > > > > > > Best regards,
> > > > > > > --
> > > > > > > Tatsuo Ishii
> > > > > > > SRA OSS, Inc. Japan
> > > > > > > English: http://www.sraoss.co.jp/index_en.php
> > > > > > > Japanese:http://www.sraoss.co.jp
> > > > > > >
> > > > > > > > Relevant config from pgpool.conf:
> > > > > > > > # - Backend Connection Settings -
> > > > > > > >
> > > > > > > > backend_hostname0 = 'dbod-ag-pg01.ch'
> > > > > > > > backend_data_directory0 = '/ORA/dbs03/AG_PG01/data'
> > > > > > > > backend_port0 = 6601
> > > > > > > > backend_weight0 = 1
> > > > > > > > backend_flag0 = 'ALLOW_TO_FAILOVER'
> > > > > > > > backend_application_name0 = 'ag_pg01'
> > > > > > > >
> > > > > > > > backend_hostname1 = 'dbod-ag-pg02.ch'
> > > > > > > > backend_data_directory1 = '/ORA/dbs03/AG_PG02/data'
> > > > > > > > backend_port1 = 6604
> > > > > > > > backend_weight1 = 1
> > > > > > > > backend_flag1 = 'ALLOW_TO_FAILOVER'
> > > > > > > > backend_application_name1 = 'ag_pg02'
> > > > > > > >
> > > > > > > > backend_hostname2 = 'dbod-ag-pg03.ch'
> > > > > > > > backend_data_directory2 = '/ORA/dbs03/AG_PG03/data'
> > > > > > > > backend_port2 = 6600
> > > > > > > > backend_weight2 = 1
> > > > > > > > backend_flag2 = 'ALLOW_TO_FAILOVER'
> > > > > > > > backend_application_name2 = 'ag_pg03'
> > > > > > > >
> > > > > > > > Relevant config from postgresql.conf / myrecovery.conf (application name exists and is same in primary_conninfo as backend_application_name):
> > > > > > > > primary_conninfo = 'host=dbod-hac-c02.ch port=6600 user=pgrepl application_name=ag_pg01 passfile=''/home/postgres/.pgpass'''
> > > > > > > >
> > > > > > > > "GRANT pg_monitor TO pgpool;” has been done:
> > > > > > > > postgres=# \dg
> > > > > > > >  List of roles
> > > > > > > >  Role name | Attributes | Member of
> > > > > > > > -----------------+------------------------------------------------------------+--------------
> > > > > > > >  admin | Create role, Create DB | {}
> > > > > > > >  dod_dbmon | | {}
> > > > > > > >  dod_pmm | Superuser | {}
> > > > > > > >  pgpool | | {pg_monitor}
> > > > > > > >  pgpool_recovery | Superuser | {}
> > > > > > > >  pgrepl | Replication | {}
> > > > > > > >  postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}
> > > > > > > >
> > > > > > > > Users for sr_check and health_check:
> > > > > > > > 'sr_check_user': 'pgpool',
> > > > > > > > 'sr_check_database': 'postgres',
> > > > > > > > 'health_check_user': 'pgpool',
> > > > > > > > 'health_check_database': 'postgres',
> > > > > > > >
> > > > > > > > Versions:
> > > > > > > > PgPool - 4.2.2
> > > > > > > > PostgreSQL - 12.6
> > > > > > > >
> > > > > > > > “show pool_nodes and show pool_health_check_stats” shows everything is fine:
> > > > > > > > postgres=# show pool_nodes;
> > > > > > > >  node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state
> > > > > > > > | replication_sync_state | last_status_change
> > > > > > > > ---------+----------------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------
> > > > > > > > +------------------------+---------------------
> > > > > > > >  0 | dbod-ag-pg01.ch | 6601 | up | 0.333333 | standby | 3 | true | 0 | streaming
> > > > > > > > | async | 2021-07-01 17:37:42
> > > > > > > >  1 | dbod-ag-pg02.ch | 6604 | up | 0.333333 | standby | 0 | false | 0 | streaming
> > > > > > > > | async | 2021-07-01 17:37:42
> > > > > > > >  2 | dbod-ag-pg03.ch | 6600 | up | 0.333333 | primary | 2 | false | 0 |
> > > > > > > > | | 2021-07-01 16:51:00
> > > > > > > > (3 rows)
> > > > > > > >
> > > > > > > > postgres=# SHOW POOL_HEALTH_CHECK_STATS;
> > > > > > > >  node_id | hostname | port | status | role | last_status_change | total_count | success_count | fail_count | skip_count | retry
> > > > > > > > _count | average_retry_count | max_retry_count | max_duration | min_duration | average_duration | last_health_check | last_successful_health_
> > > > > > > > check | last_skip_health_check | last_failed_health_check
> > > > > > > > ---------+----------------------+------+--------+---------+---------------------+-------------+---------------+------------+------------+------
> > > > > > > > -------+---------------------+-----------------+--------------+--------------+------------------+---------------------+------------------------
> > > > > > > > ------+------------------------+--------------------------
> > > > > > > >  0 | dbod-ag-pg01.ch | 6601 | up | standby | 2021-07-01 17:37:42 | 2029 | 1998 | 0 | 31 | 0
> > > > > > > >  | 0.000000 | 0 | 1033 | 21 | 26.975475 | 2021-07-01 17:38:47 | 2021-07-01 17:38:47
> > > > > > > >  | 2021-07-01 16:52:00 |
> > > > > > > >  1 | dbod-ag-pg02.ch | 6604 | up | standby | 2021-07-01 17:37:42 | 2027 | 1850 | 1 | 176 | 5
> > > > > > > >  | 0.002701 | 5 | 20010 | 23 | 40.249595 | 2021-07-01 17:38:50 | 2021-07-01 17:38:50
> > > > > > > >  | 2021-07-01 16:55:53 | 2021-07-01 14:05:05
> > > > > > > >  2 | dbod-ag-pg03.ch | 6600 | up | primary | 2021-07-01 16:51:00 | 2024 | 2010 | 0 | 14 | 0
> > > > > > > >  | 0.000000 | 0 | 834 | 23 | 54.582090 | 2021-07-01 17:38:52 | 2021-07-01 17:38:52
> > > > > > > >  | 2021-07-01 16:38:07 |
> > > > > > > > (3 rows)
> > > > > > > >
> > > > > > > > If I understand correctly, this is the query that pgpool runs on standbys to understand if they are talking to the current primary and the result of this query seems alright to me:
> > > > > > > > postgres=# select * from pg_stat_wal_receiver;
> > > > > > > >  pid | status | receive_start_lsn | receive_start_tli | received_lsn | received_tli | last_msg_send_time | last_msg_receipt_
> > > > > > > > time | latest_end_lsn | latest_end_time | slot_name | sender_host | sender_port |
> > > > > > > >  conninfo
> > > > > > > >
> > > > > > > > ------+-----------+-------------------+-------------------+--------------+--------------+-------------------------------+----------------------
> > > > > > > > ---------+----------------+-------------------------------+-----------+----------------------+-------------+-----------------------------------
> > > > > > > > -----------------------------------------------------------------------------------------------------------------------------------------------
> > > > > > > > --------------------------------------------------------------------------------
> > > > > > > >  1651 | streaming | 1/0 | 5 | 1/10009ED8 | 5 | 2021-07-01 19:41:12.659496+02 | 2021-07-01 19:41:12.6
> > > > > > > > 59872+02 | 1/10009ED8 | 2021-07-01 19:41:12.659496+02 | | dbod-hac-c02.ch | 6600 | user=pgrepl passfile=/home/postgre
> > > > > > > > s/.pgpass dbname=replication host=dbod-hac-c02.ch port=6600 application_name=ag_pg01 fallback_application_name=walreceiver sslmode=prefer
> > > > > > > > sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any
> > > > > > > > (1 row)
> > > > > > > >
> > > > > > > > Cheers!
> > > > > > > >
> > > > > > > > Anirudh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20210710/fa00f716/attachment-0001.htm>