[pgpool-committers: 3745] Re: pgpool: Fix usage of wait(2) in pgpool main process

Tatsuo Ishii ishii at sraoss.co.jp
Wed Jan 4 15:39:29 JST 2017


> Hi ishii-San
> 
> I am looking into the issue
> http://www.pgpool.net/mantisbt/view.php?id=249, where
> pgpool-II sometimes does not de-escalations while shutting down. And as per
> the bug report, the issue starts to appear after this commit.
> 
> Although I am not able to replicate the exact reported issue but It seems
> like the changes made by this commit can leave the zombie processes.
> 
> As we are replacing the wait(NULL) with waitpid(,..WNOHANG)
> 
> @@ -1365,8 +1367,10 @@ static RETSIGTYPE exit_handler(int sig)
>         POOL_SETMASK(&UnBlockSig);
>      do
>      {
> -        wpid = wait(NULL);
> -    }while (wpid > 0 || (wpid == -1 && errno == EINTR));
> +               int ret_pid;
> +        wpid = waitpid(-1, &ret_pid, WNOHANG);
> +    } while (wpid > 0 || (wpid == -1 && errno == EINTR));
> 
> The problem with this logic is that after replacing the wait(NULL) with
> waitpid(,..WNOHANG) we can move forward without waiting for all child
> process to finish, especially if some child process takes a little longer
> to finish. Since waitpid() returns 0 indicating that there is no
> exiting process at the moment, even when the child processes exists.
> For example,
> at the time of system shutdown, the watchdog process sometimes takes few
> seconds to execute the de-escalation process before exiting, and meanwhile
> in the main process as soon as waitpid( WNOHANG) would return 0 and the
> pgpool-II main process exits itself leaving the watchdog process as a
> zombie.

You are right. I should have not used WNOHANG here. The line should
have been:

        wpid = waitpid(-1, &ret_pid, 0);

> Also, is it possible if you can share the scenario where you ran into the
> infinite wait situation, as there may be some other issue in the code since
> as per the wait() system call documentation it returns -1 when there is no
> child process, so theoretically wait() call should not cause the infinite
> wait.

Not remember clearly but it maybe the case When a child receives a
stop signal (SIGSTOP).

> On Thu, Jul 7, 2016 at 11:55 AM, Tatsuo Ishii <ishii at postgresql.org> wrote:
> 
>> Fix usage of wait(2) in pgpool main process
>>
>> Per [pgpool-hackers: 1444]. Here is the copy of the message:
>>
>> Hi Usama,
>>
>> I have noticed that the usage of wait(2) in pgpool main could cause
>> infinite wait in the system call.
>>
>>     /* wait for all children to exit */
>>     do
>>     {
>>         wpid = wait(NULL);
>>     }while (wpid > 0 || (wpid == -1 && errno == EINTR));
>>
>> When child process dies, SIGCHLD signal is raised and wait(2) knows
>> the event. However, multiple child death does not necessarily creates
>> exact same number of SIGCHLD signal as the number of dead children and
>> wait(2) could wait for an event which never happens in this case. I
>> actually encountered this situation while testing pgpool-II. Solution
>> is, to use waitpid(2) instead of wait(2).
>>
>> Branch
>> ------
>> master
>>
>> Details
>> -------
>> http://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=
>> 0d1cdf96feb77de6f1dfc2d46ecd7467325d1f79
>>
>> Modified Files
>> --------------
>> src/main/pgpool_main.c | 12 ++++++++----
>> 1 file changed, 8 insertions(+), 4 deletions(-)
>>
>> _______________________________________________
>> pgpool-committers mailing list
>> pgpool-committers at pgpool.net
>> http://www.pgpool.net/mailman/listinfo/pgpool-committers
>>


More information about the pgpool-committers mailing list