-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VGCN health check #16
Comments
I already found a solution on Github for a similar problem (https://github.com/HEP-Puppet/htcondor/blob/master/templates/20_workernode.config.erb & https://github.com/HEP-Puppet/htcondor/blob/master/files/healthcheck_wn_condor) and tested this approach (calling a healthcheck-script on startup and as a cron) on my own - and it worked pretty good. The next step would be to write the actual healthchecks, which I will discuss with @bgruening on thursday. |
Status: Andreas has a python script that is started with condor and blocks condor as long as there is an error. Next steps are building VGCN and getting this script as PR in. Playing around with a local influxdb and sending events from this script to influxdb to indicate a new node is available. We also discussed potential FaaS for creating training materials with planemo. |
We generally don't need to send events to influxdb, our VGCN includes an influxdb stat of Edit: it is how we built this graph: https://stats.galaxyproject.eu/d/000000021/galaxy-condor-cluster?refresh=15m&panelId=1&fullscreen&orgId=1 |
A little status update:
Do you have an idea what my problem could be? |
cool, could you share it somehow? As a pull request to this repo maybe? :)
This sounds nice in theory, but I am not sure what other services we might target. I had not discussed this extensively with bjoern, in my mind I had imagined that this was one script that we run and based on exit code, the deployer of the script would write some wrapper which decides which services to start/stop. I had come with the assumption bjoern had only asked you to write the detection routine, and then I would supply something like
and as deployer I'd chose to run that just on boot, or on cron, or something else. But if it is in the scope of your project that you do these things additionally, then a config file sounds nice! :)
It is generally not possible to build images within VMs the bwcloud. I'm amazed it got as far as the playbook, it should have crashed much earlier. But yes, umask is not a valid task attribute, I have now removed it. I'm guessing it failed for you and not us, probably, because you are using the newest version of ansible? We use 2.7.1 |
So I made a pull request with my first version. What do you think? Is there maybe something I should add or do differently?
The idea of my additional script was basically that, just implemented in Python: It checks if everything is healthy and stops the service if a problem occurs. As I understood it, the task of my project also included this idea. Depending on your needs (is a simple healthcheck-script enough or do you need more?) I can further work on this :)
Ok, good to know :D After your commit (and installing ansible 2.7.1) I still tried it one more time one the BWCloud and it got further to this error:
I also tried to build it on my mac (with kvm-accelerator disabled), which just didn't do anything and the google-cloud (as it supports nested VMs), this time with this error:
Unfortunately I don't have a local linux machine available.. I will still try a few things, but I think the best idea for me is first to concentrate on the script - at least until my next meeting with Björn. |
The groupadd error was now fixed in #19 |
It would be nice if we could check the image before Condor starts up and adds themselves to the cluster.
A few ideas:
The text was updated successfully, but these errors were encountered: