-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
program hangs and does not exit #46
Comments
Value of |
The |
The internal code is changed much from pywebcopy import WebPage
from pywebcopy import config
def scrape(url, folder, timeout=1):
config.setup_config(url, folder)
wp = WebPage()
wp.get(url)
# start the saving process
wp.save_complete()
# join the sub threads
for t in wp._threads:
if t.is_alive():
t.join(timeout)
# location of the html file written
return wp.file_path |
I'll try that one next - thank you for your help! |
Issue should be fixed. I am closing it. |
yeah @rajatomar788 it hangs now and then! So im just looping through a list of websites, and calling scrape method as suggested by you above. Nevertheless, it usually hangs with this log: Queueing download of asset files.
Another question, why not use shutdown() method after save_complete? i dont think save_assets within save_complete is blocking right? Is it the same? I temporarily stopped the hanging by changing download_file to be not multi-threaded |
@junyango program hanging could be result of many factors, low ping could be one of them.
Yes you could use any implementation, whichever works for you.
How did you do it? |
@rajatomar788 did it hang for your case? Under the save_assets method in webpage.py, i just changed
yeah. this temporarily stops the problem from occurring since elem.run calls download_file() which does its own session.get implementation. because i realized in your implementation, the joining of threads is actually done in shutdown() method, but having tried that it still doesnt work. It gets stuck at "Queuing download of assets", so i tried to isolate the problem and found that it was the code above that was causing the problem. Now im running, albeit somewhat slower, but works pretty reliably i think. I hope my understanding of the multi-threading downloading portion is correct. And what i just did reverts it back to an implementation of single-threaded downloading. |
@junyango your implementation just undoes the entire parallel downloading capabilities of the pywebcopy.
of course single threading is error proof. If you are not heavy-lifting images filled pages then it should be good for you. |
@rajatomar788 Yup, wanted to get a working version up and running. However, given that threads.join() is called with a timeout, the threading implementation should in no case hang as well. I wonder if the people who initially raised this issue are still facing this problem. Anyway, would like to thank you for the prompt replies! :) |
If there use case is simple then there shouldn't be any problem. But for the special cases, this thread will help them. Regards. |
I'm still seeing this behavior. I've tried using the newer approach, but to no avail. I have discovered a few things, though. When using the non-threading approach, see a number of unrecognized response errors in the debug logs.
However, those errors do not appear when using threading. Instead, the join() method gets called and no error messages get written. It seems like there's a problem with how threads deal with assets that do not exist, though that's not the full story. When the head only contains non-existent references, the script completes. I don't understand threading enough to make sense of this. I've narrowed this down to the simplest possible failure. On a local server, if I try to save_complete() on a site with a single index.html page that contains the markup below (but without any of the assets), the script does not complete. However, if I comment out any single line in the tag (or I add the missing assets), the script completes and exists.
Here's the script I'm running, modified to show threads that get joined and all active threads when the function completes.
|
As you mentioned these assets does not exist then this error message is the correct log. So if you are seeing this then it is working as expected.
You can try a few things like -
Threads are marked non-active when they are joined, so in your case you won't get any active thread back because they all have been joined and become inactive. So there isn't any point in doing that. |
Obviously, that's why I pointed out the difference between non-threading and threading.
Like i said, I'm querying a local server. As for the remote files, I can 'wget' them without a problem, use the python requests module to get them just fine, and download them without threading with this very package. The problem arises when threading is used.
The point is that it should not matter if the assets are accessible or not. The package should handle this correctly by ignoring anything that throws a 404 error. And it does, unless it uses threads. So clearly your implementation of threading has a bug in it.
This has no impact other than speeding up the point of hanging.
Actually, that's not correct. When the function completes (which is not always the case), I get a list of of threads that were registered as is_active() after being joined. Unfortunately, even when the function completes, the process does not end. |
If your usecase is working out without threading then you should go for it for now. |
I eventually went the slower single-threaded route using something similar to what @junyango suggested above, and that dramatically reduced the frequency with which it hung. I still encountered a few problematic sites that wouldn't finish saving even after left for hours/overnight - though I had some others that did eventually finish after several hours of running, so perhaps the remaining "hangs" would eventually finish if I gave them enough time. But that made me wonder: would it be possible to implement an overall timeout at the save_assets level, in addition to the per-asset timeout that currently exists? It seems like the existence of an overall timeout (even if it resulted in a few more errors per thousand pages) would resolve a lot of the pain here. Believe me, if I knew how to do this myself I would post a PR - but all my attempts ended with segfaults. Regardless, thanks for all of your help on this @rajatomar788 ! |
Preliminary examinations suggests this to be caused by
The pywebcopy is designed to separate the different resources download to avoid pulling everything in the ram and causing computer to hang. So, only a complete rewrite could allow such a thing only if possible.
No worries :) Have a great day. |
I also facing the same problem. The program doesn't exit. Is there any thing to do with? @rajatomar788 |
Did you try it using single thread as mentioned above? @GalDayan |
I've tried to single thread with this code
But, when I commented this code, it downloaded only HTML without css, fonts, etc.. |
@deleuzer when I try your code something weird happens. At the first run, it gets stuck at "Queuing download of 100 assets" as @junyango mentioned. Then I had to halt it. But when I ran the same code second time it completed and stopped on its own. To be sure, I did the same second time. Same thing happened; the first run hanged, the second run worked. What could be the reason for it? It's very interesting. EDIT: When I both try your code and make the correction @junyango suggested, it completes downloading a web page with 180 assets in 2-3 minutes. And works at the first run. I guess this is the solution. |
Hello. pywebcopy/pywebcopy/webpage.py Line 230 in 2852d18
pywebcopy/pywebcopy/elements.py Line 56 in 2852d18
As solution offer this: with multiprocessing.pool.ThreadPool(processes=5) as tp:
for _ in tp.imap(lambda e: e.run(), elms):
pass def run(self):
self.download_file() P.S. Offer remove this one and replace it with parameter which allow set worker processes amount. pywebcopy/pywebcopy/globals.py Line 97 in 2852d18
|
If anyone is facing this issue and needs From my limited experience with Note: I did try setting
but that didn't seem to work. |
Hey @davidwgrossman I appreciate you making the pywebcopy single threaded. Originally the pywebcopy was made to run on single thread but large graphics websites would force you to think of multithreading but time limitations on my side have prevented a proper implementation. I would love to see a single threaded pywebcopy with a optional feature of multithreading if you are up for the task. |
Hello @rajatomar788 First of all, thank you for this great library! I added additional implementation on top of @davidwgrossman's code. It seem to fix the hanging issue and I think we can control the single/multi thread from number of pool. |
The |
@rajatomar788 here is a PR I put up to disable multithreading by default. Comments and suggestions are welcome :) #78 |
Trying Examples 1 & 2 from the "How to - Save Single Webpage" section in readme.md, as well as method 3 from examples.py. Using python 3.7, pywebcopy 6.3, and one of the example URLs from example.py: 'https://codeburst.io/building-beautiful-command-line-interfaces-with-python-26c7e1bb54df'
Issues: Method 1 & 2 hang every time. Method 3 appears to be deprecated. Nothing appears in my log_file with this approach, so difficult to troubleshoot further. And the join_timeout setting doesn't appear to have any effect.
Based on the other open issue (#35 ), I also included the thread-closing loop from examples.py.
Files are downloading, but when I try to open the main HTML file it never shows any of the images (perhaps it never got to the point of saving them?).
My code, modified from examples:
The text was updated successfully, but these errors were encountered: