Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some daemons need to be killed twice #673

Closed
mohierf opened this issue Jan 4, 2017 · 10 comments
Closed

Some daemons need to be killed twice #673

mohierf opened this issue Jan 4, 2017 · 10 comments
Labels

Comments

@mohierf
Copy link
Contributor

mohierf commented Jan 4, 2017

Started alignak and got this (ps -aux | grep alignak-):

root     11449  0.6  2.1 764092 44760 ?        Sl   15:56   0:00 alignak-scheduler scheduler-master
root     11450  0.0  1.9 165520 39632 ?        S    15:56   0:00 alignak-scheduler
root     11475  1.4  2.1 764516 45032 ?        Sl   15:56   0:00 alignak-poller poller-master
root     11476  0.0  1.9 320540 40064 ?        Sl   15:56   0:00 alignak-poller
root     11504  1.6  2.1 764512 45036 ?        Sl   15:56   0:00 alignak-reactionner reactionner-master
root     11505  0.0  1.9 320536 39996 ?        Sl   15:56   0:00 alignak-reactionner
root     11533  2.2  2.2 764932 46916 ?        Sl   15:56   0:00 alignak-broker broker-master
root     11534  0.0  1.9 164900 39340 ?        S    15:56   0:00 alignak-broker
root     11559  1.6  2.2 764604 46808 ?        Sl   15:56   0:00 alignak-receiver receiver-master
root     11560  0.0  1.9 165272 39500 ?        S    15:56   0:00 alignak-receiver
root     11585  5.0  2.2 765796 45732 ?        Sl   15:56   0:00 alignak-arbiter arbiter-master
root     11586  0.0  1.9 166268 40456 ?        S    15:56   0:00 alignak-arbiter
root     11599  0.0  2.1 764220 43228 ?        S    15:56   0:00 alignak-reactionner-master worker
root     11607  0.0  2.1 764224 43316 ?        S    15:56   0:00 alignak-poller-master worker

Then I did:

sudo pkill alignak-

Then I got this (ps -aux | grep alignak-):

root     11475  0.3  2.1 723804 43604 ?        S    15:56   0:00 alignak-poller poller-master
root     11504  0.3  2.1 723800 43572 ?        S    15:56   0:00 alignak-reactionner reactionner-master
root     11628  1.1  2.1 764780 44336 ?        S    15:57   0:00 alignak-reactionner-master worker
root     11636  1.1  2.1 764784 44416 ?        S    15:57   0:00 alignak-poller-master worker

The poller and reactionner daemons are not killed. Only the poller and reactionner workers are killed...

I must then send another kill to make the daemons stop

@mohierf
Copy link
Contributor Author

mohierf commented Jan 6, 2017

I also observed the same problem with the receiver on another configuration ...

@Seb-Solon
Copy link
Contributor

If you look closely pid are different. I am pretty sure this is pkill issue because it does guarantee which process is kiled first. Try ps -ef | grep alignak you will see the ppid and so the parent process.

You can see that leftover are usually attached to 1 and not to a parent anymore (if they were child process)

@mohierf
Copy link
Contributor Author

mohierf commented Jan 6, 2017

I will try what you suggest to understand what is happening. Thanks

@mohierf
Copy link
Contributor Author

mohierf commented Jan 7, 2017

They do not look to be attached to 1...

Before killing. I started 2 receivers in screen and I got this:

alignak  19650  1087  0 09:24 ?        00:00:00 SCREEN -d -S alignak_north_receiver -m bash -c alignak-receiver -c /usr/local/etc/alignak/daemons/North//receiverd-north.ini
alignak  19690  1087  0 09:24 ?        00:00:00 SCREEN -d -S alignak_receiver -m bash -c alignak-receiver -c /usr/local/etc/alignak/daemons/receiverd.ini
alignak  19655 19650  1 09:24 pts/23   00:00:00 alignak-receiver receiver-north
alignak  19692 19690  2 09:24 pts/30   00:00:00 alignak-receiver receiver-master
alignak  19740 19655  0 09:24 pts/23   00:00:00 alignak-receiver
alignak  19750 19692  0 09:24 pts/30   00:00:00 alignak-receiver
alignak  19901 19655  0 09:24 pts/23   00:00:00 alignak-receiver-north module: nsca_north
alignak  19918 19692  0 09:24 pts/30   00:00:00 alignak-receiver-master module: nsca

The parent process attachment seems to be consistent.

I quitted the screens and I get this:

alignak  19655  1087  0 09:24 ?        00:00:01 alignak-receiver receiver-north
alignak  19692  1087  0 09:24 ?        00:00:01 alignak-receiver receiver-master
alignak  19901 19655  0 09:24 ?        00:00:00 alignak-receiver-north module: nsca_north
alignak  19918 19692  0 09:24 ?        00:00:00 alignak-receiver-master module: nsca

Note that all other daemons are lauched the same way and they are correctly stopped when I quit the screens. Only the receiver has this behavior :/

@Seb-Solon
Copy link
Contributor

you are using screen that's why they won't attach 1. The issue is maybe in the singal handling part then. You should be able to reproduce this when launching the receiver in foreground

@mohierf
Copy link
Contributor Author

mohierf commented Jan 14, 2017

Note tha ton the demo server currently, this happens with the broker daemon and always the broker) ... and I noticed on another server the same behavior with the poller daemon (and always the poller). I update the issue title ...

@mohierf mohierf changed the title Poller and reactionner need to be killed twice Some daemons need to be killed twice Jan 14, 2017
@fpeyre
Copy link
Contributor

fpeyre commented Mar 27, 2017

I reupload my log from Alignak-monitoring/alignak-packaging#28 here.
They are some difference between the stop from the broker and the scheduler that I see.

For the scheduler, a stop write the following lines in the log :

[2017-03-23 17:30:12 UTC] INFO: [alignak.daemons.schedulerdaemon] process 23621 received a signal: 15
[2017-03-23 17:30:12 UTC] INFO: [alignak.daemon] process 23621 received a signal: 15
[2017-03-23 17:30:12 UTC] INFO: [alignak.daemon] Stopping scheduler...
[2017-03-23 17:30:12 UTC] INFO: [alignak.daemon] Shutting down http_daemon...
[2017-03-23 17:30:12 UTC] INFO: [alignak.daemon] HTTP main thread exiting
[2017-03-23 17:30:12 UTC] INFO: [alignak.daemon] Joining http_thread...
[2017-03-23 17:30:12 UTC] INFO: [alignak.daemon] Shutting down manager...
[2017-03-23 17:30:13 UTC] INFO: [alignak.daemon] Stopping all modules...
[2017-03-23 17:30:13 UTC] INFO: [alignak.daemon] Stopped scheduler.

The broker show only a line when he stop:

2017-03-23 17:28:11 UTC] INFO: [alignak.daemon] process 23411 received a signal: 15

If I understand the code, the log process X received a signal 15 is from here

So now self.interrupted should be True. So it break the while of this function and call the function request_stop() (see the code here)

So we should see a line in the log with

logger.info("Stopped %s.", self.name)

But we don't see it for the broker.

Another thing is the daemon seems have problem to stop only when they received some configurations.
The line INFO: [alignak.daemon] Stopped scheduler. doesn't appear once the scheduler receive its configuration

I will try to make more investigation (With setting the log level to debug)

schedulerd.log.txt
brokerd.log.txt

@mohierf
Copy link
Contributor Author

mohierf commented Mar 27, 2017

Great job @fpeyre and thanks for investigating this problem. I got this problem several times today when restarting Alignak on the demo server ... but I currently did not have time to investigate more :/

@mohierf
Copy link
Contributor Author

mohierf commented Apr 5, 2017

Ok. At least I found the problem!

The problem happens with daemons that have some attached modules if those modules are waiting on a message queue. When the daemon tries to stop its external modules, it sends a SIGTERM, tries to join the process and, after a delay, if the process is still alive it send a SIGKILL to kill it abruptly.

The problem with this solution is that the join method is a blocking method and, the worst is that the Queue get method is not interrupted when the process receives a signal. Thus, the external module never stops until its daemon notifies a SIGKILL.

There is probably nothing to do for this in tha Alignak core ... only the modules should be concerned, but I leave this issue opened for the moment

@mohierf mohierf closed this as completed Apr 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants