-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incomplete Read #127
Comments
Hey, @I-dontcode |
No, I haven't tried a different site, though I did attempt the "save any single page" code, and it worked. I don't believe there's a firewall or data rate limiter causing the issue. The reason being, when I run the "save full website" code, it operates smoothly until I encounter that exception error. It managed to copy around 40 GB of data before crashing. Additionally, ChatGPT explained, "This error occurs when the client expects more data to be received than what is actually received. Specifically, it's indicating that the HTTP response body is being read in chunks, and the last chunk received is incomplete." I attempted to run the code multiple times to no avail, suspecting it might be a connection or network issue. Is there a way for PyWebCopy to handle these exceptions? I stumbled upon a suggestion online to try a third-party library, 'recommended for a higher-level HTTP client interface... pip install requests.' However, I'm uncertain about its functionality or how it might interact with PyWebCopy or the 'save full website' code. As I don't have much coding background. |
The library you found 'requests' is the very thing that is used in this pywebcopy for http part. So the http part is being handled by the requests itself. I think after copying the 40 GB there may have been a server side blacklisting or load handling action against your client. |
I can still access the page via a browser and download files. Does that mean anything in regards to blacking or load handling? |
Yes, maybe. The useragent which the library uses may be blocked. Or you still try to start the script for the very page where it breaks. Try opening the page only using requests library directly. |
I ran this code to got a response from the page, using my target url of course. I'm guessing useragent is not blocked? import requests url = 'https://www.example.com' if response.status_code == 200: I should note, I tried copying the page again and it failed at the same point, right after trying to "pywebcopy.elements:778" the same file and showing "already exists at:". Any ideas? I'm going to try to deleting that file/folder and try running it again. |
Yes. Or you can just overwrite=True. |
I just remembered something, although I'm not sure if it's related. I initially got a error running the website copy code. ImportError: lxml.html.clean module is now a separate project lxml_html_clean. So I ended up installing lxml-html-clean directly. Could this somehow be related? |
Maybe not. Because there is no use of lxml clean functionality in the save methods. |
Failed with the same issue, although it failed at a different file point this time. Could this possibly be a character limit issue with the directory/site path? Just spitballing. |
Yes it could be. Because the pathlimit on general systems is 256 characters but since you are going very deep inside site it could create errors related to character limits. |
Does pywebcopy have the ability check for this, and possibly truncate, before copying a file and it's site/directory path? How can I avoid this other shortening the destination directory? |
There is a function called url2path in the urls python file. This function is responsible for generating filepaths from urls. It currently doesn't truncate paths but you can customise the repo. |
I cant copy a full website, it errors out. Something about IncompleteRead, see below. I'm using the given full website copy code with my target url. I've tried running the code multiple times, same issue. How i can fix this? I'm a dumb dumb, so explain it like am i 5 year old :)
Traceback (most recent call last):
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 597, in _read_chunked
value.append(self._safe_read(amt))
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 642, in _safe_read
raise IncompleteRead(data, amt-len(data))
http.client.IncompleteRead: IncompleteRead(483 bytes read, 1053 more expected)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 737, in _error_catcher
yield
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 862, in _raw_read
data = self._fp_read(amt, read1=read1) if not fp_closed else b""
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 845, in _fp_read
return self._fp.read(amt) if amt is not None else self._fp.read()
^^^^^^^^^^^^^^^^^^
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 473, in read
return self._read_chunked(amt)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 607, in _read_chunked
raise IncompleteRead(b''.join(value)) from exc
http.client.IncompleteRead: IncompleteRead(0 bytes read)
The text was updated successfully, but these errors were encountered: