Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting up JobMonitor Hanging #53

Open
cancan101 opened this issue Jul 21, 2015 · 5 comments
Open

Setting up JobMonitor Hanging #53

cancan101 opened this issue Jul 21, 2015 · 5 comments

Comments

@cancan101
Copy link

=====================================
========   Submit and Wait   ========
=====================================

sending function jobs to cluster

2015-07-21 18:15:13,077 - gridmap.job - INFO - Setting up JobMonitor on tcp://X.X.X.X:37905

Do I need to make sure to open certain ports?

@cancan101
Copy link
Author

With more logging:

sending function jobs to cluster

2015-07-22 07:16:39,356 - gridmap.job - INFO - Setting up JobMonitor on tcp://10.1.3.165:35470
2015-07-22 07:16:39,771 - gridmap.job - DEBUG - Starting local hearbeat
2015-07-22 07:16:39,773 - gridmap.job - DEBUG - Starting ZMQ event loop
2015-07-22 07:16:39,773 - gridmap.job - DEBUG - 0 out of 4 jobs completed
2015-07-22 07:16:39,773 - gridmap.job - DEBUG - Waiting for message
2015-07-22 07:16:39,776 - gridmap.runner - DEBUG - Connecting to JobMonitor (tcp://10.1.3.165:35470)
2015-07-22 07:16:39,777 - gridmap.runner - DEBUG - Sending message: {u'command': u'heart_beat', u'ip_address': '10.1.3.165', u'job_id': -1, u'data': {}, u'host_name': 'ip-10-1-3-165'}
2015-07-22 07:16:39,777 - gridmap.job - DEBUG - Received message: {u'host_name': 'ip-10-1-3-165', u'ip_address': '10.1.3.165', u'command': u'heart_beat', u'job_id': -1, u'data': {}}
2015-07-22 07:16:39,777 - gridmap.job - DEBUG - Checking if jobs are alive
2015-07-22 07:16:39,777 - gridmap.job - DEBUG - Sending reply: 
2015-07-22 07:16:39,778 - gridmap.job - DEBUG - 0 out of 4 jobs completed
2015-07-22 07:16:39,778 - gridmap.job - DEBUG - Waiting for message

@dan-blanchard
Copy link
Contributor

It's not hanging then. You should check qstat to make sure that the jobs actually got started. If they start and disappear right away, that means there is probably an unpickling problem with the job (or a firewall issue where the workers can't talk to JobMonitor). You can investigate that by logging into those machines and looking in whatever directory you have specified for temp_dir (it defaults to /scratch/).

@cancan101
Copy link
Author

It looks like the workers are trying to load drmaa and failing to do so:

Traceback (most recent call last):
  File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/runpy.py", line 151, in _run_module_as_main
    mod_name, loader, code, fname = _get_module_details(mod_name)
  File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/runpy.py", line 101, in _get_module_details
    loader = get_loader(mod_name)
  File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/pkgutil.py", line 464, in get_loader
    return find_loader(fullname)
  File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/pkgutil.py", line 474, in find_loader
    for importer in iter_importers(fullname):
  File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/pkgutil.py", line 430, in iter_importers
    __import__(pkg)
  File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/gridmap/__init__.py", line 69, in <module>
    from gridmap.conf import (CHECK_FREQUENCY, CREATE_PLOTS, DEFAULT_QUEUE,
  File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/gridmap/conf.py", line 76, in <module>
    import drmaa
  File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/drmaa/__init__.py", line 63, in <module>
    from .session import JobInfo, JobTemplate, Session
  File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/drmaa/session.py", line 39, in <module>
    from drmaa.helpers import (adapt_rusage, Attribute, attribute_names_iterator,
  File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/drmaa/helpers.py", line 36, in <module>
    from drmaa.wrappers import (drmaa_attr_names_t, drmaa_attr_values_t,
  File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/drmaa/wrappers.py", line 56, in <module>
    _lib = CDLL(libpath, mode=RTLD_GLOBAL)
  File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/ctypes/__init__.py", line 365, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /opt/sge/lib//libdrmaa.so: cannot open shared object file: No such file or directory

which is odd since the env looks like DRMAA_LIBRARY_PATH=/opt/sge/lib/lx-amd64/libdrmaa.so (and that file is available on the workers).

Why do the workers need to load drmaa?

@dan-blanchard
Copy link
Contributor

Weird. I've never seen that raise an OSError when failing to import before. I've updated the code (5883a77) so that that won't happen in the future.

As you point out, the workers shouldn't need drmaa.

@cancan101
Copy link
Author

Okay. That explained the hanging.

That being said, now I am not seeing any warnings in the log due to this:

No handlers could be found for logger "gridmap.conf"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants