program hangs and does not exit #46

youngblood · 2020-04-23T23:56:46Z

Trying Examples 1 & 2 from the "How to - Save Single Webpage" section in readme.md, as well as method 3 from examples.py. Using python 3.7, pywebcopy 6.3, and one of the example URLs from example.py: 'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df'

Issues: Method 1 & 2 hang every time. Method 3 appears to be deprecated. Nothing appears in my log_file with this approach, so difficult to troubleshoot further. And the join_timeout setting doesn't appear to have any effect.

Based on the other open issue (#35 ), I also included the thread-closing loop from examples.py.

Files are downloading, but when I try to open the main HTML file it never shows any of the images (perhaps it never got to the point of saving them?).

My code, modified from examples:

import time
import threading

import pywebcopy

preferred_clock = time.time

project_url = 'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df'
project_folder = '/Users/user/Downloads/scraped_content'
project_name = 'example_project'

pywebcopy.config.setup_config(
	project_url=project_url,
	project_folder=project_folder,
	project_name=project_name,
	over_write=True,
	bypass_robots=True,
	debug=False,
	log_file='/Users/user/Downloads/scraped_content/pwc_log.log',
	join_timeout=30
)

start = preferred_clock()

# method_1 - This one hangs every time (never finishes so I have to halt).
'''
pywebcopy.save_webpage(url=project_url,
					   project_folder=project_folder,
					   project_name=project_name)
'''

# method_2 - This one also hangs every time
wp = pywebcopy.WebPage()
wp.get(project_url)
wp.save_complete()
wp.shutdown()

# method_3_from_examples.py - this one is deprecated: 
# "Direct initialisation with url is not supported now."
'''
pywebcopy.WebPage(url=project_url,project_folder=project_folder).save_complete()
'''

for thread in threading.enumerate():
    if thread == threading.main_thread():
        continue
    else:
        thread.join()

print("Execution time : ", preferred_clock() - start)```

The text was updated successfully, but these errors were encountered:

rajatomar788 · 2020-04-24T06:14:02Z

Value of join_timeout is applied to each thread, and you have set it to 30, so its most likely waiting on each thread for 30 seconds and looking like it froze.

rajatomar788 · 2020-04-24T06:25:24Z

The log_file parameter is removed due to flushing errors. See #36.

youngblood · 2020-04-24T14:11:44Z

Well I let it run (Method 2 above) for hours overnight last night, with just that one URL to scrape, but it was still stuck at the same place this morning:

Then I set join_timeout to 5 and tried again, still with just the one codeburst.io URL, and got the same result.

Then I tried with two other URLs:
https://owl.purdue.edu/owl/general_writing/academic_writing/establishing_arguments/rhetorical_strategies.html
and
http://www.history.com/topics/cold-war/hollywood-ten

So then I tried using Method 2 on a list of 3 URLs, just to see if it would get past the first one. It didn't. It still hangs on the first url in the list:

# -*- coding: utf-8 -*-

import os
import time
import threading

import pywebcopy

preferred_clock = time.time

project_url = 'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df'
# project_url = 'https://owl.purdue.edu/owl/general_writing/academic_writing/establishing_arguments/rhetorical_strategies.html'
# project_url = 'http://www.history.com/topics/cold-war/hollywood-ten'
project_folder = '/Users/reed/Downloads/scraped_content'
project_name = 'example_project'

pywebcopy.config.setup_config(
	project_url=project_url,
	project_folder=project_folder,
	project_name=project_name,
	over_write=True,
	bypass_robots=True,
	debug=False,
	log_file='/Users/reed/Downloads/scraped_content/pwc_log.log',
	join_timeout=5
)

start = preferred_clock()

# method_1 - This one hangs every time (never finishes so I have to halt).
'''
pywebcopy.save_webpage(url=project_url,
					   project_folder=project_folder,
					   project_name=project_name)
'''

# method_2 - I made this one up based on what I pieced together.
urls = [
	'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df',
	'https://owl.purdue.edu/owl/general_writing/academic_writing/establishing_arguments/rhetorical_strategies.html',
	'http://www.history.com/topics/cold-war/hollywood-ten'
]

for url in urls:
	wp = pywebcopy.WebPage()
	wp.get(url)
	wp.save_complete()
	wp.shutdown()
	for thread in threading.enumerate():
	    if thread == threading.main_thread():
	        continue
	    else:
	        thread.join()

# method_3_from_examples.py - this one is deprecated: 
# "Direct initialisation with url is not supported now."
'''
pywebcopy.WebPage(url=project_url,project_folder=project_folder).save_complete()
'''

print("Execution time : ", preferred_clock() - start)```

rajatomar788 · 2020-04-25T04:16:46Z

The internal code is changed much
without changing the examples.py,
so I would do it this way

from pywebcopy import WebPage
from pywebcopy import config


def scrape(url, folder, timeout=1):
  
    config.setup_config(url, folder)

    wp = WebPage()
    wp.get(url)

    # start the saving process
    wp.save_complete()

    # join the sub threads
    for t in wp._threads:
        if t.is_alive():
           t.join(timeout)

    # location of the html file written 
    return wp.file_path

youngblood · 2020-04-27T00:49:17Z

I'll try that one next - thank you for your help!

rajatomar788 · 2020-05-04T01:59:11Z

Issue should be fixed. I am closing it.

junyango · 2020-06-02T17:43:49Z

yeah @rajatomar788 it hangs now and then!

So im just looping through a list of websites, and calling scrape method as suggested by you above.

Nevertheless, it usually hangs with this log: Queueing download of asset files.

from pywebcopy import WebPage
from pywebcopy import config

def scrape(url, folder, timeout=1):
  
    config.setup_config(url, folder)

    wp = WebPage()
    wp.get(url)

    # start the saving process
    wp.save_complete()

    # join the sub threads
    for t in wp._threads:
        if t.is_alive():
           t.join(timeout)

    # location of the html file written 
    return wp.file_path

def main():
	output_folder = "jy_scrape"

	links = initialize_list("Location to txt file")

	for link in links:
		try:
			scrape(link, output_folder)
		except Exception:
			continue

if __name__ == '__main__':
	main()

Another question, why not use shutdown() method after save_complete? i dont think save_assets within save_complete is blocking right? Is it the same?

I temporarily stopped the hanging by changing download_file to be not multi-threaded

rajatomar788 · 2020-06-04T06:55:18Z

@junyango program hanging could be result of many factors, low ping could be one of them.
If you have used the code as above then you should definitely check the ping.

Another question, why not use shutdown() method after save_complete? i dont think save_assets within save_complete is blocking right? Is it the same?

Yes you could use any implementation, whichever works for you.

I temporarily stopped the hanging by changing download_file to be not multi-threaded

How did you do it?

junyango · 2020-06-04T06:57:01Z

@rajatomar788 did it hang for your case? Under the save_assets method in webpage.py, i just changed

 for elem in elms:
            elem.run()
            # with POOL_LIMIT:
            #     t = threading.Thread(name=repr(elem), target=elem.run)
            #     t.start()
            #     self._threads.append(t)

yeah. this temporarily stops the problem from occurring since elem.run calls download_file() which does its own session.get implementation.

because i realized in your implementation, the joining of threads is actually done in shutdown() method, but having tried that it still doesnt work. It gets stuck at "Queuing download of assets", so i tried to isolate the problem and found that it was the code above that was causing the problem. Now im running, albeit somewhat slower, but works pretty reliably i think.

I hope my understanding of the multi-threading downloading portion is correct. And what i just did reverts it back to an implementation of single-threaded downloading.

rajatomar788 · 2020-06-04T07:02:44Z

@junyango your implementation just undoes the entire parallel downloading capabilities of the pywebcopy.

I hope my understanding of the multi-threading downloading portion is correct. And what i just did reverts it back to an implementation of single-threaded downloading.

of course single threading is error proof. If you are not heavy-lifting images filled pages then it should be good for you.

junyango · 2020-06-04T07:05:12Z

@rajatomar788 Yup, wanted to get a working version up and running. However, given that threads.join() is called with a timeout, the threading implementation should in no case hang as well. I wonder if the people who initially raised this issue are still facing this problem.

Anyway, would like to thank you for the prompt replies! :)

rajatomar788 · 2020-06-04T07:07:50Z

@junyango

However, given that threads.join() is called with a timeout, the threading implementation should in no case hang as well. I wonder if the people who initially raised this issue are still facing this problem.

If there use case is simple then there shouldn't be any problem. But for the special cases, this thread will help them.

Regards.

deleuzer · 2020-06-13T23:04:37Z

I'm still seeing this behavior. I've tried using the newer approach, but to no avail. I have discovered a few things, though. When using the non-threading approach, see a number of unrecognized response errors in the debug logs.

elements   - ERROR    - URL returned an unknown response: [http://webdowntest.local/static/path/to/second/nothing.png]

However, those errors do not appear when using threading. Instead, the join() method gets called and no error messages get written.

It seems like there's a problem with how threads deal with assets that do not exist, though that's not the full story. When the head only contains non-existent references, the script completes. I don't understand threading enough to make sense of this.

I've narrowed this down to the simplest possible failure. On a local server, if I try to save_complete() on a site with a single index.html page that contains the markup below (but without any of the assets), the script does not complete.

However, if I comment out any single line in the tag (or I add the missing assets), the script completes and exists.

<html>
  <head>
    <meta charset="utf-8">
    <link href="https://use.fontawesome.com/releases/v5.0.8/css/all.css" rel="stylesheet">
    <link href="https://fonts.googleapis.com/css?family=Open+Sans" rel="stylesheet">
    <link href="https://fonts.googleapis.com/css?family=Roboto+Slab:700" rel="stylesheet">
    <link name="first-static-path-to-nothing" href="/static/path/to/first/nothing.png" >
    <link name="second-static-path-to-nothing" href="/static/path/to/second/nothing.png" >
</head>
  <body>
    Here comes everybody.
  </body>
</html>

Here's the script I'm running, modified to show threads that get joined and all active threads when the function completes.

from pywebcopy import WebPage
from pywebcopy import config

url = "http://webdowntest.local"

def scrape(url, folder, timeout=1):

    config.setup_config(url, folder, debug=True)

    wp = WebPage()
 
    wp.get(url)

    # start the saving process
    wp.save_complete()

    # join the sub threads
    for t in wp._threads:
        if t.is_alive():
           t.join(timeout)
           print(f'AFTER JOIN: {t.name}')
    
    active_threads = []
    for i in wp._threads:
        if i.is_alive():
            active_threads.append(i)

    # location of the html file written
    return [wp.file_path, active_threads]

a = scrape(url, 'webdown2')
print(a)

rajatomar788 · 2020-06-14T03:47:22Z

@deleuzer

When using the non-threading approach, see a number of unrecognized response errors in the debug logs.
elements   - ERROR    - URL returned an unknown response: [http://webdowntest.local/static/path/to/second/nothing.png]

As you mentioned these assets does not exist then this error message is the correct log. So if you are seeing this then it is working as expected.

However, if I comment out any single line in the tag (or I add the missing assets), the script completes and exists.

You can try a few things like -

check your internet connections ping
check if the files you have added as assets do exist through a browser.
try setting the timeout to 0 or 0.001 seconds.

Here's the script I'm running, modified to show threads that get joined and all active threads when the function completes.

Threads are marked non-active when they are joined, so in your case you won't get any active thread back because they all have been joined and become inactive. So there isn't any point in doing that.

deleuzer · 2020-06-14T09:56:27Z

@deleuzer
elements - ERROR - URL returned an unknown response: [http://webdowntest.local/static/path/to/second/nothing.png]
As you mentioned these assets does not exist then this error message is the correct log. So if you are seeing this then it is working as expected.

Obviously, that's why I pointed out the difference between non-threading and threading.

However, if I comment out any single line in the tag (or I add the missing assets), the script completes and exists.

You can try a few things like -
You can try to reproduce the problem following the very easy steps I gave you. That's why I offered such a robust comment with the simplest implementation of the problem.
1. check your internet connections ping

Like i said, I'm querying a local server. As for the remote files, I can 'wget' them without a problem, use the python requests module to get them just fine, and download them without threading with this very package. The problem arises when threading is used.

2. check if the files you have added as assets do exist through a browser.

The point is that it should not matter if the assets are accessible or not. The package should handle this correctly by ignoring anything that throws a 404 error. And it does, unless it uses threads. So clearly your implementation of threading has a bug in it.

3. try setting the timeout to 0 or 0.001 seconds.

This has no impact other than speeding up the point of hanging.

Here's the script I'm running, modified to show threads that get joined and all active threads when the function completes.

Threads are marked non-active when they are joined, so in your case you won't get any active thread back because they all have been joined and become inactive. So there isn't any point in doing that.

Actually, that's not correct. When the function completes (which is not always the case), I get a list of of threads that were registered as is_active() after being joined. Unfortunately, even when the function completes, the process does not end.

rajatomar788 · 2020-06-15T15:05:27Z

If your usecase is working out without threading then you should go for it for now.
I will unittest it out a little bit more or you can contribute through pr.

youngblood · 2020-07-07T23:32:08Z

I eventually went the slower single-threaded route using something similar to what @junyango suggested above, and that dramatically reduced the frequency with which it hung. I still encountered a few problematic sites that wouldn't finish saving even after left for hours/overnight - though I had some others that did eventually finish after several hours of running, so perhaps the remaining "hangs" would eventually finish if I gave them enough time.

But that made me wonder: would it be possible to implement an overall timeout at the save_assets level, in addition to the per-asset timeout that currently exists? It seems like the existence of an overall timeout (even if it resulted in a few more errors per thousand pages) would resolve a lot of the pain here. Believe me, if I knew how to do this myself I would post a PR - but all my attempts ended with segfaults.

Regardless, thanks for all of your help on this @rajatomar788 !

rajatomar788 · 2020-07-08T04:42:03Z

@youngblood

some others that did eventually finish after several hours of running, so perhaps the remaining "hangs" would eventually finish if I gave them enough time.

Preliminary examinations suggests this to be caused by requests library or the underlying urllib3 library timeouts. I am trying to figure out the exact issue if it is fixable then I would do that.

would it be possible to implement an overall timeout at the save_assets level, in addition to the per-asset timeout that currently exists?

The pywebcopy is designed to separate the different resources download to avoid pulling everything in the ram and causing computer to hang. So, only a complete rewrite could allow such a thing only if possible.

Believe me, if I knew how to do this myself I would post a PR - but all my attempts ended with segfaults.

No worries :) Have a great day.

GalDayan · 2020-07-21T07:53:01Z

I also facing the same problem. The program doesn't exit. Is there any thing to do with? @rajatomar788

rajatomar788 · 2020-07-21T07:55:35Z

Did you try it using single thread as mentioned above? @GalDayan

GalDayan · 2020-07-21T08:52:37Z

@rajatomar788

Did you try it using single thread as mentioned above? @GalDayan

I've tried to single thread with this code

for elem in elms:
            elem.run()
            # with POOL_LIMIT:
            #     t = threading.Thread(name=repr(elem), target=elem.run)
            #     t.start()
            #     self._threads.append(t)

But, when I commented this code, it downloaded only HTML without css, fonts, etc..

ghost · 2020-10-07T10:55:22Z

@deleuzer when I try your code something weird happens. At the first run, it gets stuck at "Queuing download of 100 assets" as @junyango mentioned. Then I had to halt it. But when I ran the same code second time it completed and stopped on its own. To be sure, I did the same second time. Same thing happened; the first run hanged, the second run worked.

What could be the reason for it? It's very interesting.

EDIT: When I both try your code and make the correction @junyango suggested, it completes downloading a web page with 180 assets in 2-3 minutes. And works at the first run. I guess this is the solution.

CutePotatoDev · 2021-03-21T19:08:22Z

Hello.
Looks like combination of these two generating some kind of semaphore deadlock. Not figuring out what kind exactly.

pywebcopy/pywebcopy/webpage.py

Line 230 in 2852d18

for elem in elms:

pywebcopy/pywebcopy/elements.py

Line 56 in 2852d18

with POOL_LIMIT:

As solution offer this:

with multiprocessing.pool.ThreadPool(processes=5) as tp:
    for _ in tp.imap(lambda e: e.run(), elms):
        pass

def run(self):
    self.download_file()

P.S. Offer remove this one and replace it with parameter which allow set worker processes amount.

pywebcopy/pywebcopy/globals.py

Line 97 in 2852d18

POOL_LIMIT = threading.Semaphore(5)

gravelcycles · 2021-06-12T21:05:08Z

If anyone is facing this issue and needs pywebcopy to work, check out this version where I removed all multithreading. It did not seem that much slower, but it definitely does not hang!! Here's the commit in case we want to support a single-threaded version of pywebcopy in the future. I would be happy to make pywebcopy single-threaded by default with the optional feature of using multithreading.

From my limited experience with pywebcopy, multithreading is currently broken and made it impossible to copy websites. Supporting single threading would make this package 1000000% better in my opinion :)

Note: I did try setting

POOL_LIMIT = threading.Semaphore(1)

but that didn't seem to work.

cc @rajatomar788

https://github.com/davidwgrossman/pywebcopy

rajatomar788 · 2021-06-13T03:27:29Z

Hey @davidwgrossman

I appreciate you making the pywebcopy single threaded.

Originally the pywebcopy was made to run on single thread but large graphics websites would force you to think of multithreading but time limitations on my side have prevented a proper implementation.

I would love to see a single threaded pywebcopy with a optional feature of multithreading if you are up for the task.

darawaleep · 2021-06-21T14:22:19Z

Hello @rajatomar788

First of all, thank you for this great library!

I added additional implementation on top of @davidwgrossman's code.
To run multi-thread on save_asset method.

It seem to fix the hanging issue and I think we can control the single/multi thread from number of pool.
Here is the commit.

darawaleep@6d0af9d

rajatomar788 · 2021-06-22T09:27:51Z

@darawaleep

The concurrent library is not available in the python 2 version. Hence it can't be the solution for the multithreading issue that we are facing currently.

gravelcycles · 2021-08-05T04:21:58Z

@rajatomar788 here is a PR I put up to disable multithreading by default. Comments and suggestions are welcome :) #78

rajatomar788 closed this as completed May 4, 2020

rajatomar788 reopened this Jun 4, 2020

rajatomar788 closed this as completed Jun 4, 2020

rajatomar788 mentioned this issue Jun 13, 2020

save_webpage() never exits #51

Closed

rajatomar788 reopened this Jun 14, 2020

rajatomar788 changed the title ~~Examples hang or are deprecated~~ program hangs and does not exit Jun 14, 2020

gingeleski mentioned this issue Jun 21, 2020

Program hangs forever gingeleski/pywebcopy-import#2

Closed

rajatomar788 mentioned this issue Jul 13, 2020

Hangs in involuntary places. Using a basic example. #54

Closed

rajatomar788 closed this as completed Jul 21, 2020

rajatomar788 reopened this Jul 21, 2020

rajatomar788 mentioned this issue Aug 28, 2020

Script not completed #57

Closed

ghost mentioned this issue Oct 7, 2020

UnicodeDecodeError issue with downloaded .html file rajatomar788/pywebcopy7#1

Closed

Dfte mentioned this issue Jul 5, 2021

Create-Template stalls until keyboard interrupt s0lst1c3/eaphammer#172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

program hangs and does not exit #46

program hangs and does not exit #46

youngblood commented Apr 23, 2020 •

edited

Loading

rajatomar788 commented Apr 24, 2020

rajatomar788 commented Apr 24, 2020

youngblood commented Apr 24, 2020 •

edited

Loading

rajatomar788 commented Apr 25, 2020

youngblood commented Apr 27, 2020

rajatomar788 commented May 4, 2020

junyango commented Jun 2, 2020 •

edited

Loading

rajatomar788 commented Jun 4, 2020

junyango commented Jun 4, 2020 •

edited

Loading

rajatomar788 commented Jun 4, 2020

junyango commented Jun 4, 2020 •

edited

Loading

rajatomar788 commented Jun 4, 2020

deleuzer commented Jun 13, 2020

rajatomar788 commented Jun 14, 2020

deleuzer commented Jun 14, 2020

rajatomar788 commented Jun 15, 2020

youngblood commented Jul 7, 2020

rajatomar788 commented Jul 8, 2020

GalDayan commented Jul 21, 2020 •

edited

Loading

rajatomar788 commented Jul 21, 2020

GalDayan commented Jul 21, 2020 •

edited

Loading

ghost commented Oct 7, 2020 •

edited by ghost

Loading

CutePotatoDev commented Mar 21, 2021 •

edited

Loading

gravelcycles commented Jun 12, 2021

rajatomar788 commented Jun 13, 2021

darawaleep commented Jun 21, 2021

rajatomar788 commented Jun 22, 2021

gravelcycles commented Aug 5, 2021

program hangs and does not exit #46

program hangs and does not exit #46

Comments

youngblood commented Apr 23, 2020 • edited Loading

rajatomar788 commented Apr 24, 2020

rajatomar788 commented Apr 24, 2020

youngblood commented Apr 24, 2020 • edited Loading

rajatomar788 commented Apr 25, 2020

youngblood commented Apr 27, 2020

rajatomar788 commented May 4, 2020

junyango commented Jun 2, 2020 • edited Loading

rajatomar788 commented Jun 4, 2020

junyango commented Jun 4, 2020 • edited Loading

rajatomar788 commented Jun 4, 2020

junyango commented Jun 4, 2020 • edited Loading

rajatomar788 commented Jun 4, 2020

deleuzer commented Jun 13, 2020

rajatomar788 commented Jun 14, 2020

deleuzer commented Jun 14, 2020

rajatomar788 commented Jun 15, 2020

youngblood commented Jul 7, 2020

rajatomar788 commented Jul 8, 2020

GalDayan commented Jul 21, 2020 • edited Loading

rajatomar788 commented Jul 21, 2020

GalDayan commented Jul 21, 2020 • edited Loading

ghost commented Oct 7, 2020 • edited by ghost Loading

CutePotatoDev commented Mar 21, 2021 • edited Loading

gravelcycles commented Jun 12, 2021

rajatomar788 commented Jun 13, 2021

darawaleep commented Jun 21, 2021

rajatomar788 commented Jun 22, 2021

gravelcycles commented Aug 5, 2021

youngblood commented Apr 23, 2020 •

edited

Loading

youngblood commented Apr 24, 2020 •

edited

Loading

junyango commented Jun 2, 2020 •

edited

Loading

junyango commented Jun 4, 2020 •

edited

Loading

junyango commented Jun 4, 2020 •

edited

Loading

GalDayan commented Jul 21, 2020 •

edited

Loading

GalDayan commented Jul 21, 2020 •

edited

Loading

ghost commented Oct 7, 2020 •

edited by ghost

Loading

CutePotatoDev commented Mar 21, 2021 •

edited

Loading