Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

program hangs and does not exit #46

Open
youngblood opened this issue Apr 23, 2020 · 28 comments
Open

program hangs and does not exit #46

youngblood opened this issue Apr 23, 2020 · 28 comments

Comments

@youngblood
Copy link

youngblood commented Apr 23, 2020

Trying Examples 1 & 2 from the "How to - Save Single Webpage" section in readme.md, as well as method 3 from examples.py. Using python 3.7, pywebcopy 6.3, and one of the example URLs from example.py: 'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df'

Issues: Method 1 & 2 hang every time. Method 3 appears to be deprecated. Nothing appears in my log_file with this approach, so difficult to troubleshoot further. And the join_timeout setting doesn't appear to have any effect.

Based on the other open issue (#35 ), I also included the thread-closing loop from examples.py.

Files are downloading, but when I try to open the main HTML file it never shows any of the images (perhaps it never got to the point of saving them?).

My code, modified from examples:

import time
import threading

import pywebcopy

preferred_clock = time.time

project_url = 'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df'
project_folder = '/Users/user/Downloads/scraped_content'
project_name = 'example_project'

pywebcopy.config.setup_config(
	project_url=project_url,
	project_folder=project_folder,
	project_name=project_name,
	over_write=True,
	bypass_robots=True,
	debug=False,
	log_file='/Users/user/Downloads/scraped_content/pwc_log.log',
	join_timeout=30
)

start = preferred_clock()

# method_1 - This one hangs every time (never finishes so I have to halt).
'''
pywebcopy.save_webpage(url=project_url,
					   project_folder=project_folder,
					   project_name=project_name)
'''

# method_2 - This one also hangs every time
wp = pywebcopy.WebPage()
wp.get(project_url)
wp.save_complete()
wp.shutdown()

# method_3_from_examples.py - this one is deprecated: 
# "Direct initialisation with url is not supported now."
'''
pywebcopy.WebPage(url=project_url,project_folder=project_folder).save_complete()
'''

for thread in threading.enumerate():
    if thread == threading.main_thread():
        continue
    else:
        thread.join()

print("Execution time : ", preferred_clock() - start)```
@rajatomar788
Copy link
Owner

Value of join_timeout is applied to each thread, and you have set it to 30, so its most likely waiting on each thread for 30 seconds and looking like it froze.

@rajatomar788
Copy link
Owner

The log_file parameter is removed due to flushing errors. See #36.

@youngblood
Copy link
Author

youngblood commented Apr 24, 2020

Well I let it run (Method 2 above) for hours overnight last night, with just that one URL to scrape, but it was still stuck at the same place this morning:
image

Then I set join_timeout to 5 and tried again, still with just the one codeburst.io URL, and got the same result.

Then I tried with two other URLs:
https://owl.purdue.edu/owl/general_writing/academic_writing/establishing_arguments/rhetorical_strategies.html
and
http://www.history.com/topics/cold-war/hollywood-ten

So then I tried using Method 2 on a list of 3 URLs, just to see if it would get past the first one. It didn't. It still hangs on the first url in the list:

# -*- coding: utf-8 -*-

import os
import time
import threading

import pywebcopy

preferred_clock = time.time

project_url = 'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df'
# project_url = 'https://owl.purdue.edu/owl/general_writing/academic_writing/establishing_arguments/rhetorical_strategies.html'
# project_url = 'http://www.history.com/topics/cold-war/hollywood-ten'
project_folder = '/Users/reed/Downloads/scraped_content'
project_name = 'example_project'

pywebcopy.config.setup_config(
	project_url=project_url,
	project_folder=project_folder,
	project_name=project_name,
	over_write=True,
	bypass_robots=True,
	debug=False,
	log_file='/Users/reed/Downloads/scraped_content/pwc_log.log',
	join_timeout=5
)

start = preferred_clock()

# method_1 - This one hangs every time (never finishes so I have to halt).
'''
pywebcopy.save_webpage(url=project_url,
					   project_folder=project_folder,
					   project_name=project_name)
'''

# method_2 - I made this one up based on what I pieced together.
urls = [
	'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df',
	'https://owl.purdue.edu/owl/general_writing/academic_writing/establishing_arguments/rhetorical_strategies.html',
	'http://www.history.com/topics/cold-war/hollywood-ten'
]

for url in urls:
	wp = pywebcopy.WebPage()
	wp.get(url)
	wp.save_complete()
	wp.shutdown()
	for thread in threading.enumerate():
	    if thread == threading.main_thread():
	        continue
	    else:
	        thread.join()

# method_3_from_examples.py - this one is deprecated: 
# "Direct initialisation with url is not supported now."
'''
pywebcopy.WebPage(url=project_url,project_folder=project_folder).save_complete()
'''

print("Execution time : ", preferred_clock() - start)```

@rajatomar788
Copy link
Owner

The internal code is changed much
without changing the examples.py,
so I would do it this way

from pywebcopy import WebPage
from pywebcopy import config


def scrape(url, folder, timeout=1):
  
    config.setup_config(url, folder)

    wp = WebPage()
    wp.get(url)

    # start the saving process
    wp.save_complete()

    # join the sub threads
    for t in wp._threads:
        if t.is_alive():
           t.join(timeout)

    # location of the html file written 
    return wp.file_path

@youngblood
Copy link
Author

I'll try that one next - thank you for your help!

@rajatomar788
Copy link
Owner

Issue should be fixed. I am closing it.

@junyango
Copy link

junyango commented Jun 2, 2020

yeah @rajatomar788 it hangs now and then!

So im just looping through a list of websites, and calling scrape method as suggested by you above.

Nevertheless, it usually hangs with this log: Queueing download of asset files.

image

from pywebcopy import WebPage
from pywebcopy import config

def scrape(url, folder, timeout=1):
  
    config.setup_config(url, folder)

    wp = WebPage()
    wp.get(url)

    # start the saving process
    wp.save_complete()

    # join the sub threads
    for t in wp._threads:
        if t.is_alive():
           t.join(timeout)

    # location of the html file written 
    return wp.file_path

def main():
	output_folder = "jy_scrape"

	links = initialize_list("Location to txt file")

	for link in links:
		try:
			scrape(link, output_folder)
		except Exception:
			continue

if __name__ == '__main__':
	main()

Another question, why not use shutdown() method after save_complete? i dont think save_assets within save_complete is blocking right? Is it the same?

I temporarily stopped the hanging by changing download_file to be not multi-threaded

@rajatomar788 rajatomar788 reopened this Jun 4, 2020
@rajatomar788
Copy link
Owner

@junyango program hanging could be result of many factors, low ping could be one of them.
If you have used the code as above then you should definitely check the ping.

Another question, why not use shutdown() method after save_complete? i dont think save_assets within save_complete is blocking right? Is it the same?

Yes you could use any implementation, whichever works for you.

I temporarily stopped the hanging by changing download_file to be not multi-threaded

How did you do it?

@junyango
Copy link

junyango commented Jun 4, 2020

@rajatomar788 did it hang for your case? Under the save_assets method in webpage.py, i just changed

 for elem in elms:
            elem.run()
            # with POOL_LIMIT:
            #     t = threading.Thread(name=repr(elem), target=elem.run)
            #     t.start()
            #     self._threads.append(t)

yeah. this temporarily stops the problem from occurring since elem.run calls download_file() which does its own session.get implementation.

because i realized in your implementation, the joining of threads is actually done in shutdown() method, but having tried that it still doesnt work. It gets stuck at "Queuing download of assets", so i tried to isolate the problem and found that it was the code above that was causing the problem. Now im running, albeit somewhat slower, but works pretty reliably i think.

I hope my understanding of the multi-threading downloading portion is correct. And what i just did reverts it back to an implementation of single-threaded downloading.

@rajatomar788
Copy link
Owner

@junyango your implementation just undoes the entire parallel downloading capabilities of the pywebcopy.

I hope my understanding of the multi-threading downloading portion is correct. And what i just did reverts it back to an implementation of single-threaded downloading.

of course single threading is error proof. If you are not heavy-lifting images filled pages then it should be good for you.

@junyango
Copy link

junyango commented Jun 4, 2020

@rajatomar788 Yup, wanted to get a working version up and running. However, given that threads.join() is called with a timeout, the threading implementation should in no case hang as well. I wonder if the people who initially raised this issue are still facing this problem.

Anyway, would like to thank you for the prompt replies! :)

@rajatomar788
Copy link
Owner

@junyango

However, given that threads.join() is called with a timeout, the threading implementation should in no case hang as well. I wonder if the people who initially raised this issue are still facing this problem.

If there use case is simple then there shouldn't be any problem. But for the special cases, this thread will help them.

Regards.

@deleuzer
Copy link

I'm still seeing this behavior. I've tried using the newer approach, but to no avail. I have discovered a few things, though. When using the non-threading approach, see a number of unrecognized response errors in the debug logs.

elements   - ERROR    - URL returned an unknown response: [http://webdowntest.local/static/path/to/second/nothing.png]

However, those errors do not appear when using threading. Instead, the join() method gets called and no error messages get written.

It seems like there's a problem with how threads deal with assets that do not exist, though that's not the full story. When the head only contains non-existent references, the script completes. I don't understand threading enough to make sense of this.

I've narrowed this down to the simplest possible failure. On a local server, if I try to save_complete() on a site with a single index.html page that contains the markup below (but without any of the assets), the script does not complete.

However, if I comment out any single line in the tag (or I add the missing assets), the script completes and exists.

<html>
  <head>
    <meta charset="utf-8">
    <link href="https://use.fontawesome.com/releases/v5.0.8/css/all.css" rel="stylesheet">
    <link href="https://fonts.googleapis.com/css?family=Open+Sans" rel="stylesheet">
    <link href="https://fonts.googleapis.com/css?family=Roboto+Slab:700" rel="stylesheet">
    <link name="first-static-path-to-nothing" href="/static/path/to/first/nothing.png" >
    <link name="second-static-path-to-nothing" href="/static/path/to/second/nothing.png" >
</head>
  <body>
    Here comes everybody.
  </body>
</html>

Here's the script I'm running, modified to show threads that get joined and all active threads when the function completes.

from pywebcopy import WebPage
from pywebcopy import config

url = "http://webdowntest.local"

def scrape(url, folder, timeout=1):

    config.setup_config(url, folder, debug=True)

    wp = WebPage()
 
    wp.get(url)

    # start the saving process
    wp.save_complete()

    # join the sub threads
    for t in wp._threads:
        if t.is_alive():
           t.join(timeout)
           print(f'AFTER JOIN: {t.name}')
    
    active_threads = []
    for i in wp._threads:
        if i.is_alive():
            active_threads.append(i)

    # location of the html file written
    return [wp.file_path, active_threads]

a = scrape(url, 'webdown2')
print(a)

@rajatomar788
Copy link
Owner

@deleuzer

When using the non-threading approach, see a number of unrecognized response errors in the debug logs.

elements   - ERROR    - URL returned an unknown response: [http://webdowntest.local/static/path/to/second/nothing.png]

As you mentioned these assets does not exist then this error message is the correct log. So if you are seeing this then it is working as expected.

However, if I comment out any single line in the tag (or I add the missing assets), the script completes and exists.

You can try a few things like -

  1. check your internet connections ping
  2. check if the files you have added as assets do exist through a browser.
  3. try setting the timeout to 0 or 0.001 seconds.

Here's the script I'm running, modified to show threads that get joined and all active threads when the function completes.

Threads are marked non-active when they are joined, so in your case you won't get any active thread back because they all have been joined and become inactive. So there isn't any point in doing that.

@rajatomar788 rajatomar788 reopened this Jun 14, 2020
@rajatomar788 rajatomar788 changed the title Examples hang or are deprecated program hangs and does not exit Jun 14, 2020
@deleuzer
Copy link

@deleuzer

elements - ERROR - URL returned an unknown response: [http://webdowntest.local/static/path/to/second/nothing.png]

As you mentioned these assets does not exist then this error message is the correct log. So if you are seeing this then it is working as expected.

Obviously, that's why I pointed out the difference between non-threading and threading.

However, if I comment out any single line in the tag (or I add the missing assets), the script completes and exists.

You can try a few things like -
You can try to reproduce the problem following the very easy steps I gave you. That's why I offered such a robust comment with the simplest implementation of the problem.

1. check your internet connections ping

Like i said, I'm querying a local server. As for the remote files, I can 'wget' them without a problem, use the python requests module to get them just fine, and download them without threading with this very package. The problem arises when threading is used.

2. check if the files you have added as assets do exist through a browser.

The point is that it should not matter if the assets are accessible or not. The package should handle this correctly by ignoring anything that throws a 404 error. And it does, unless it uses threads. So clearly your implementation of threading has a bug in it.

3. try setting the timeout to 0 or 0.001 seconds.

This has no impact other than speeding up the point of hanging.

Here's the script I'm running, modified to show threads that get joined and all active threads when the function completes.

Threads are marked non-active when they are joined, so in your case you won't get any active thread back because they all have been joined and become inactive. So there isn't any point in doing that.

Actually, that's not correct. When the function completes (which is not always the case), I get a list of of threads that were registered as is_active() after being joined. Unfortunately, even when the function completes, the process does not end.

@rajatomar788
Copy link
Owner

If your usecase is working out without threading then you should go for it for now.
I will unittest it out a little bit more or you can contribute through pr.

@youngblood
Copy link
Author

I eventually went the slower single-threaded route using something similar to what @junyango suggested above, and that dramatically reduced the frequency with which it hung. I still encountered a few problematic sites that wouldn't finish saving even after left for hours/overnight - though I had some others that did eventually finish after several hours of running, so perhaps the remaining "hangs" would eventually finish if I gave them enough time.

But that made me wonder: would it be possible to implement an overall timeout at the save_assets level, in addition to the per-asset timeout that currently exists? It seems like the existence of an overall timeout (even if it resulted in a few more errors per thousand pages) would resolve a lot of the pain here. Believe me, if I knew how to do this myself I would post a PR - but all my attempts ended with segfaults.

Regardless, thanks for all of your help on this @rajatomar788 !

@rajatomar788
Copy link
Owner

@youngblood

some others that did eventually finish after several hours of running, so perhaps the remaining "hangs" would eventually finish if I gave them enough time.

Preliminary examinations suggests this to be caused by requests library or the underlying urllib3 library timeouts. I am trying to figure out the exact issue if it is fixable then I would do that.

would it be possible to implement an overall timeout at the save_assets level, in addition to the per-asset timeout that currently exists?

The pywebcopy is designed to separate the different resources download to avoid pulling everything in the ram and causing computer to hang. So, only a complete rewrite could allow such a thing only if possible.

Believe me, if I knew how to do this myself I would post a PR - but all my attempts ended with segfaults.

No worries :) Have a great day.

@GalDayan
Copy link

GalDayan commented Jul 21, 2020

I also facing the same problem. The program doesn't exit. Is there any thing to do with? @rajatomar788

@rajatomar788
Copy link
Owner

Did you try it using single thread as mentioned above? @GalDayan

@GalDayan
Copy link

GalDayan commented Jul 21, 2020

@rajatomar788

Did you try it using single thread as mentioned above? @GalDayan

I've tried to single thread with this code

for elem in elms:
            elem.run()
            # with POOL_LIMIT:
            #     t = threading.Thread(name=repr(elem), target=elem.run)
            #     t.start()
            #     self._threads.append(t)

But, when I commented this code, it downloaded only HTML without css, fonts, etc..

@ghost
Copy link

ghost commented Oct 7, 2020

@deleuzer when I try your code something weird happens. At the first run, it gets stuck at "Queuing download of 100 assets" as @junyango mentioned. Then I had to halt it. But when I ran the same code second time it completed and stopped on its own. To be sure, I did the same second time. Same thing happened; the first run hanged, the second run worked.

What could be the reason for it? It's very interesting.

EDIT: When I both try your code and make the correction @junyango suggested, it completes downloading a web page with 180 assets in 2-3 minutes. And works at the first run. I guess this is the solution.

@CutePotatoDev
Copy link

CutePotatoDev commented Mar 21, 2021

Hello.
Looks like combination of these two generating some kind of semaphore deadlock. Not figuring out what kind exactly.

for elem in elms:

with POOL_LIMIT:

As solution offer this:

with multiprocessing.pool.ThreadPool(processes=5) as tp:
    for _ in tp.imap(lambda e: e.run(), elms):
        pass
def run(self):
    self.download_file()

P.S. Offer remove this one and replace it with parameter which allow set worker processes amount.

POOL_LIMIT = threading.Semaphore(5)

@gravelcycles
Copy link

If anyone is facing this issue and needs pywebcopy to work, check out this version where I removed all multithreading. It did not seem that much slower, but it definitely does not hang!! Here's the commit in case we want to support a single-threaded version of pywebcopy in the future. I would be happy to make pywebcopy single-threaded by default with the optional feature of using multithreading.

From my limited experience with pywebcopy, multithreading is currently broken and made it impossible to copy websites. Supporting single threading would make this package 1000000% better in my opinion :)

Note: I did try setting

POOL_LIMIT = threading.Semaphore(1) 

but that didn't seem to work.

cc @rajatomar788

https://github.com/davidwgrossman/pywebcopy

@rajatomar788
Copy link
Owner

Hey @davidwgrossman

I appreciate you making the pywebcopy single threaded.

Originally the pywebcopy was made to run on single thread but large graphics websites would force you to think of multithreading but time limitations on my side have prevented a proper implementation.

I would love to see a single threaded pywebcopy with a optional feature of multithreading if you are up for the task.

@darawaleep
Copy link

Hello @rajatomar788

First of all, thank you for this great library!

I added additional implementation on top of @davidwgrossman's code.
To run multi-thread on save_asset method.

It seem to fix the hanging issue and I think we can control the single/multi thread from number of pool.
Here is the commit.

darawaleep@6d0af9d

@rajatomar788
Copy link
Owner

@darawaleep

The concurrent library is not available in the python 2 version. Hence it can't be the solution for the multithreading issue that we are facing currently.

@gravelcycles
Copy link

@rajatomar788 here is a PR I put up to disable multithreading by default. Comments and suggestions are welcome :) #78

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants