View Issue Details

IDProjectCategoryView StatusLast Update
0000303Pgpool-IIBugpublic2017-08-03 14:44
Reporternagata Assigned Tot-ishii  
PrioritynormalSeverityminorReproducibilityunable to reproduce
Status resolvedResolutionopen 
Summary0000303: db_node_is is not properly initialized
DescriptionPOOL_CONNECTION has a member named db_node_id that is DB node id for this connection. This value is used to output log or degenerate the backend in pool_read(), for example.


 221 /* if fail_over_on_backend_error is true, then trigger failover */
 222 if (pool_config->fail_over_on_backend_error)
 223 {
 224 notice_backend_error(cp->db_node_id, true);
 225
 226 /* If we are in the main process, we will not exit */
 227 child_exit(POOL_EXIT_AND_RESTART);
 228 ereport(ERROR,
 229 (errmsg("unable to read data from DB node %d",cp->db_node_id),
 230 errdetail("socket read failed with an error \"%s\"", strerror(errno))));
 231 }

However, db_node_id is initialized only in connect_backend() which is called only by child processes. In other processes the value is not initialized and always 0.

So, when pool_read is called from the parent process or the worker process during health check or streaming-replication delay check, and emits an error in the above code, log message output is as following, no matter which backend it tried to read from.

 ERROR: unable to read data from DB node 0
                            ~~~~~~~~~~~~~~~
To make the matter worse, if it is the worker process, backend 0 is degenerated.
(If it is the parent process, notice_backend_error() do nothing.)

The resolution is to initialize properly even in the parent process or the worker process.

I attached the patch to fix it. What do you think about this?
TagsNo tags attached.

Activities

nagata

2017-04-19 17:05

developer  

db_node_id.patch (7,255 bytes)   
db_node_id.patch (7,255 bytes)   

t-ishii

2017-04-20 10:57

developer   ~0001435

> To make the matter worse, if it is the worker process, backend 0 is degenerated.

Are you sure?

Replication delay check worker does not trigger in this case.
Typical log is something like:
> 2017-04-20 10:47:13: pid 22543: LOG: failed to connect to PostgreSQL server on "133.137.175.105:11003", getsockopt() detected error "Connection refused"
and pool_read is not called, thus does not trigger failover.

nagata

2017-04-20 13:06

developer   ~0001436

Last edited: 2017-04-20 13:13

I've not reproduced it, so the mechanism is my guess from the source code.
However, certainly the log provided from our client shows that

 2017-04-12 01:26:18: pid 7548: LOG: fork a new worker child process with pid: 12603
 ...
 2017-04-12 01:38:18: pid 12603: LOG: received degenerate backend request for node_id: 0 from pid [12603]
 ...
 2017-04-12 01:38:18: pid 12603: WARNING: child_exit: called from invalid process. ignored.
 2017-04-12 01:38:18: pid 12603: ERROR: unable to read data from DB node 0
 2017-04-12 01:38:18: pid 12603: DETAIL: socket read failed with an error "Connection reset by peer"

pool_read() is called during authentication in make_persistent_db_connection()
do_query() also calls pool_read()

t-ishii

2017-04-20 14:15

developer   ~0001437

Ok.

Regarding your patch, I don't like to modify pool_open(). I recommend leave it as it is. More members may be added to struct POOOL_CONNECTION in the future. We don't want to change pool_open() signature to set such new properties. We should add values for struct members such as is_backend and db_node_id, after calling pool_open().

nagata

2017-04-20 15:54

developer   ~0001438

Why don't initialize the new member in pool_open() even when a new member is added to POOL_CONNECTION? I think it is better to initialize the structure's members as early stage as possible to avoid use uninitialized variable.

t-ishii

2017-04-20 16:03

developer   ~0001439

As I said, that way will bring big maintenance headache in the future.

nagata

2017-04-20 16:17

developer   ~0001440

But, if we add a new member, anyway we have to maintenance this including initialization. So, is it easier to handle it in a single function like pool_open() than doing it different places?

nagata

2017-04-26 13:54

developer   ~0001475

Am I missing something?

t-ishii

2017-04-27 07:57

developer   ~0001480

Go for it.

t-ishii

2017-08-03 12:03

developer   ~0001619

Since there was no progress since April 2017, I took over this and committed/pushed fix.

Issue History

Date Modified Username Field Change
2017-04-19 17:05 nagata New Issue
2017-04-19 17:05 nagata File Added: db_node_id.patch
2017-04-20 10:57 t-ishii Note Added: 0001435
2017-04-20 13:06 nagata Note Added: 0001436
2017-04-20 13:07 nagata Note Edited: 0001436
2017-04-20 13:13 nagata Note Edited: 0001436
2017-04-20 14:15 t-ishii Note Added: 0001437
2017-04-20 15:54 nagata Note Added: 0001438
2017-04-20 16:03 t-ishii Note Added: 0001439
2017-04-20 16:17 nagata Note Added: 0001440
2017-04-26 13:54 nagata Note Added: 0001475
2017-04-27 07:57 t-ishii Note Added: 0001480
2017-04-27 08:01 t-ishii Assigned To => nagata
2017-04-27 08:01 t-ishii Status new => assigned
2017-08-03 12:02 t-ishii Assigned To nagata => t-ishii
2017-08-03 12:03 t-ishii Note Added: 0001619
2017-08-03 14:44 t-ishii Status assigned => resolved