Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs indicated as running never actually start. #42

Open
kbruegge opened this issue Apr 3, 2015 · 6 comments
Open

Jobs indicated as running never actually start. #42

kbruegge opened this issue Apr 3, 2015 · 6 comments

Comments

@kbruegge
Copy link

kbruegge commented Apr 3, 2015

Hello!

I really like your project but I'm having trouble running your example code in examples\manual.py.
When I run it I get the promising output:

=====================================
========   Submit and Wait   ========
=====================================

sending function jobs to cluster.
2015-04-03 16:19:05,742 - gridmap.job - INFO - Setting up JobMonitor on tcp://10.194.168.53:52713

The output of qstat also looks fine:

$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 423383 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 [email protected]     1
 423384 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 [email protected]     1
 423385 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 [email protected]     1
 423386 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 [email protected]     1

As you can see, the jobs are indicated as (r)unning.

The problem however is that the jobs never actually seem to finish. Which is odd since the calculation should when done locally
takes about 10 seconds. As expected since the function sleep_walk(10) is being called.

I then modified your example to skip the sleep function and write out a file called test.txt. But nothing ever happens.

Which brings me to my second question. How do I use the JobMonitor feature? I didnt gather much information from your
documentation I'm afraid.

Any help is much appreciated. Also if there is any way I can contribute please let me know.

Kai

@dan-blanchard
Copy link
Contributor

There is substantial overhead in starting jobs up on SGE (about a minute), so even when it says "running", that may not actually be true. GridMap is intended to be used for tasks that will take at least a few minutes to run, because otherwise the overhead is not in any way worth it. The example is kind of a bad one, because the calculations are so fast, so all you'll notice is the overhead.

If you let it run for like 5 minutes and it still doesn't finish, then there's probably a real issue.

As for JobMonitor, if you want more info you can either set the logging level to DEBUG (which will give you a ton of information), or run gridmap_web, which will give you a web wrapper around JobMonitor. It isn't very feature-rich yet, so I usually just use JobMonitor with debug logging when things aren't working right.

If you want to know more about how things work, check out this detailed rundown on the wiki.

I'm well aware that the documentation for GridMap could use some work (see #39), but I actually no longer actively use gridmap because I've changed jobs and now work at a company that doesn't use SGE (or any DRMAA-compatible grid). If you want to help out with documentation or by tackling any of the open issues, please make a PR. Thanks for offering!

@kbruegge
Copy link
Author

kbruegge commented Apr 3, 2015

Thanks for your reply.
My jobs just hit the walltime limit which was at 2 hours. So there seems to be something wrong :)
I also started some jobs with DEBUG log level. The output looks okay as far as I can tell. Don't know
about job_id : -1. It just repeats the following lines over and over:

    .
    .
    .
    2015-04-03 17:23:26,986 - gridmap.runner - DEBUG - Connecting to JobMonitor (tcp://10.194.168.53:61096)
    2015-04-03 17:23:26,986 - gridmap.runner - DEBUG - Sending message: {'command': 'heart_beat', 'data': {}, 'ip_address': '10.194.168.53', 'host_name': 'the_host_name', 'job_id': -1}
    2015-04-03 17:23:26,987 - gridmap.job - DEBUG - Received message: {'command': 'heart_beat', 'data': {}, 'ip_address': '10.194.168.53', 'host_name': 'the_host_name', 'job_id': -1}
    2015-04-03 17:23:26,987 - gridmap.job - DEBUG - Checking if jobs are alive
    2015-04-03 17:23:26,987 - gridmap.job - DEBUG - Sending reply:
    2015-04-03 17:23:26,987 - gridmap.job - DEBUG - 0 out of 4 jobs completed
    2015-04-03 17:23:26,987 - gridmap.job - DEBUG - Waiting for message
    .
    .
    .

If I can get this to work on our clusters I'll gladly contribute to documentation as I go along and figure things out. If this works for what im trying to
do then a bunch of people from my group might use it as well.

@dan-blanchard
Copy link
Contributor

The job_id: -1 means those messages are actually from the JobMonitor itself. It's how it knows to check if the jobs are alive and if its heard from them. If you don't see any messages from any jobs with IDs other than -1, then it implies that maybe you've got some sort of firewall issue preventing the workers from connecting to the JobMonitor.

@djoffe
Copy link

djoffe commented Feb 20, 2018

I am hitting the exact same issue. Was this ever fixed?
How would I see debug info from the worker jobs, to find out if these are firewall issues?

Thanks

@djoffe
Copy link

djoffe commented Feb 20, 2018

Found the issue for my case, leaving some traces in case anyone else comes here:

I am using SGE grid. Checking job status after they finished showed:

$ qacct -j 3555
==============================================================
...
failed       26  : opening input/output file
...

Turned out the default temp_dir (defined as /scratch/ in gridmap.conf) exists but is inaccessible in my case. This error is not caught by _append_job_to_session in job.py.
The default temp_dir can be overridden by passing tmp_dir as argument to process_jobs

    job_outputs = process_jobs(
        functionJobs,
        max_processes=4,
        temp_dir='/path/to/tmp/',
    )

I am not sure what the intended way of overriding gridmap.conf default values is.

@kalkairis
Copy link

Running into the same issue as people above me.
With the following code:

import gridmap


def foo(x, y):
    return x * y


if __name__ == "__main__":
    jobs = []

    for i in range(10):
        job = gridmap.Job(foo, [i, i + 1])
        jobs.append(job)
    job_outputs = gridmap.process_jobs(jobs, max_processes=4, quiet=False)
    print(job_outputs)

The code never reaches the print(job_outputs) section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants