Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete Read #127

Open
I-dontcode opened this issue Apr 14, 2024 · 13 comments
Open

Incomplete Read #127

I-dontcode opened this issue Apr 14, 2024 · 13 comments

Comments

@I-dontcode
Copy link

I cant copy a full website, it errors out. Something about IncompleteRead, see below. I'm using the given full website copy code with my target url. I've tried running the code multiple times, same issue. How i can fix this? I'm a dumb dumb, so explain it like am i 5 year old :)

Traceback (most recent call last):
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 597, in _read_chunked
value.append(self._safe_read(amt))
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 642, in _safe_read
raise IncompleteRead(data, amt-len(data))
http.client.IncompleteRead: IncompleteRead(483 bytes read, 1053 more expected)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 737, in _error_catcher
yield
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 862, in _raw_read
data = self._fp_read(amt, read1=read1) if not fp_closed else b""
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 845, in _fp_read
return self._fp.read(amt) if amt is not None else self._fp.read()
^^^^^^^^^^^^^^^^^^
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 473, in read
return self._read_chunked(amt)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 607, in _read_chunked
raise IncompleteRead(b''.join(value)) from exc
http.client.IncompleteRead: IncompleteRead(0 bytes read)

@rajatomar788
Copy link
Owner

Hey, @I-dontcode
The above errors are originating from python standard library http and urllib3.
So, did you try pywebcopy on a different website?
Is there a firewall or data rate limiter on your pc or router?

@I-dontcode
Copy link
Author

I-dontcode commented Apr 14, 2024

No, I haven't tried a different site, though I did attempt the "save any single page" code, and it worked. I don't believe there's a firewall or data rate limiter causing the issue. The reason being, when I run the "save full website" code, it operates smoothly until I encounter that exception error. It managed to copy around 40 GB of data before crashing.

Additionally, ChatGPT explained, "This error occurs when the client expects more data to be received than what is actually received. Specifically, it's indicating that the HTTP response body is being read in chunks, and the last chunk received is incomplete." I attempted to run the code multiple times to no avail, suspecting it might be a connection or network issue.

Is there a way for PyWebCopy to handle these exceptions? I stumbled upon a suggestion online to try a third-party library, 'recommended for a higher-level HTTP client interface... pip install requests.' However, I'm uncertain about its functionality or how it might interact with PyWebCopy or the 'save full website' code. As I don't have much coding background.

@rajatomar788
Copy link
Owner

The library you found 'requests' is the very thing that is used in this pywebcopy for http part. So the http part is being handled by the requests itself. I think after copying the 40 GB there may have been a server side blacklisting or load handling action against your client.

@I-dontcode
Copy link
Author

I can still access the page via a browser and download files. Does that mean anything in regards to blacking or load handling?

@rajatomar788
Copy link
Owner

Yes, maybe. The useragent which the library uses may be blocked. Or you still try to start the script for the very page where it breaks.

Try opening the page only using requests library directly.

@I-dontcode
Copy link
Author

I ran this code to got a response from the page, using my target url of course. I'm guessing useragent is not blocked?

import requests

url = 'https://www.example.com'
response = requests.get(url)

if response.status_code == 200:
print(response.text) # This will print the HTML content of the webpage
else:
print('Failed to retrieve the webpage. Status code:', response.status_code)

I should note, I tried copying the page again and it failed at the same point, right after trying to "pywebcopy.elements:778" the same file and showing "already exists at:". Any ideas? I'm going to try to deleting that file/folder and try running it again.

@rajatomar788
Copy link
Owner

Yes. Or you can just overwrite=True.

@I-dontcode
Copy link
Author

I just remembered something, although I'm not sure if it's related. I initially got a error running the website copy code.

ImportError: lxml.html.clean module is now a separate project lxml_html_clean.

So I ended up installing lxml-html-clean directly. Could this somehow be related?

@rajatomar788
Copy link
Owner

Maybe not. Because there is no use of lxml clean functionality in the save methods.

@I-dontcode
Copy link
Author

I-dontcode commented Apr 16, 2024

Failed with the same issue, although it failed at a different file point this time. Could this possibly be a character limit issue with the directory/site path? Just spitballing.

@rajatomar788
Copy link
Owner

Yes it could be. Because the pathlimit on general systems is 256 characters but since you are going very deep inside site it could create errors related to character limits.

@I-dontcode
Copy link
Author

Does pywebcopy have the ability check for this, and possibly truncate, before copying a file and it's site/directory path? How can I avoid this other shortening the destination directory?

@rajatomar788
Copy link
Owner

There is a function called url2path in the urls python file. This function is responsible for generating filepaths from urls. It currently doesn't truncate paths but you can customise the repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants