Incomplete Read #127

I-dontcode · 2024-04-14T00:44:55Z

I cant copy a full website, it errors out. Something about IncompleteRead, see below. I'm using the given full website copy code with my target url. I've tried running the code multiple times, same issue. How i can fix this? I'm a dumb dumb, so explain it like am i 5 year old :)

Traceback (most recent call last):
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 597, in _read_chunked
value.append(self._safe_read(amt))
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 642, in _safe_read
raise IncompleteRead(data, amt-len(data))
http.client.IncompleteRead: IncompleteRead(483 bytes read, 1053 more expected)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 737, in _error_catcher
yield
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 862, in _raw_read
data = self._fp_read(amt, read1=read1) if not fp_closed else b""
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 845, in _fp_read
return self._fp.read(amt) if amt is not None else self._fp.read()
^^^^^^^^^^^^^^^^^^
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 473, in read
return self._read_chunked(amt)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 607, in _read_chunked
raise IncompleteRead(b''.join(value)) from exc
http.client.IncompleteRead: IncompleteRead(0 bytes read)

rajatomar788 · 2024-04-14T03:41:27Z

Hey, @I-dontcode
The above errors are originating from python standard library http and urllib3.
So, did you try pywebcopy on a different website?
Is there a firewall or data rate limiter on your pc or router?

I-dontcode · 2024-04-14T04:55:45Z

No, I haven't tried a different site, though I did attempt the "save any single page" code, and it worked. I don't believe there's a firewall or data rate limiter causing the issue. The reason being, when I run the "save full website" code, it operates smoothly until I encounter that exception error. It managed to copy around 40 GB of data before crashing.

Additionally, ChatGPT explained, "This error occurs when the client expects more data to be received than what is actually received. Specifically, it's indicating that the HTTP response body is being read in chunks, and the last chunk received is incomplete." I attempted to run the code multiple times to no avail, suspecting it might be a connection or network issue.

Is there a way for PyWebCopy to handle these exceptions? I stumbled upon a suggestion online to try a third-party library, 'recommended for a higher-level HTTP client interface... pip install requests.' However, I'm uncertain about its functionality or how it might interact with PyWebCopy or the 'save full website' code. As I don't have much coding background.

rajatomar788 · 2024-04-14T05:09:18Z

The library you found 'requests' is the very thing that is used in this pywebcopy for http part. So the http part is being handled by the requests itself. I think after copying the 40 GB there may have been a server side blacklisting or load handling action against your client.

I-dontcode · 2024-04-14T05:57:34Z

I can still access the page via a browser and download files. Does that mean anything in regards to blacking or load handling?

rajatomar788 · 2024-04-14T05:59:55Z

Yes, maybe. The useragent which the library uses may be blocked. Or you still try to start the script for the very page where it breaks.

Try opening the page only using requests library directly.

I-dontcode · 2024-04-14T15:54:21Z

I ran this code to got a response from the page, using my target url of course. I'm guessing useragent is not blocked?

import requests

url = 'https://www.example.com'
response = requests.get(url)

if response.status_code == 200:
print(response.text) # This will print the HTML content of the webpage
else:
print('Failed to retrieve the webpage. Status code:', response.status_code)

I should note, I tried copying the page again and it failed at the same point, right after trying to "pywebcopy.elements:778" the same file and showing "already exists at:". Any ideas? I'm going to try to deleting that file/folder and try running it again.

rajatomar788 · 2024-04-14T16:08:36Z

Yes. Or you can just overwrite=True.

I-dontcode · 2024-04-14T17:05:26Z

I just remembered something, although I'm not sure if it's related. I initially got a error running the website copy code.

ImportError: lxml.html.clean module is now a separate project lxml_html_clean.

So I ended up installing lxml-html-clean directly. Could this somehow be related?

rajatomar788 · 2024-04-14T18:13:12Z

Maybe not. Because there is no use of lxml clean functionality in the save methods.

I-dontcode · 2024-04-16T22:01:24Z

Failed with the same issue, although it failed at a different file point this time. Could this possibly be a character limit issue with the directory/site path? Just spitballing.

rajatomar788 · 2024-04-17T03:47:34Z

Yes it could be. Because the pathlimit on general systems is 256 characters but since you are going very deep inside site it could create errors related to character limits.

I-dontcode · 2024-04-17T04:16:04Z

Does pywebcopy have the ability check for this, and possibly truncate, before copying a file and it's site/directory path? How can I avoid this other shortening the destination directory?

rajatomar788 · 2024-04-17T05:26:46Z

There is a function called url2path in the urls python file. This function is responsible for generating filepaths from urls. It currently doesn't truncate paths but you can customise the repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incomplete Read #127

Incomplete Read #127

I-dontcode commented Apr 14, 2024

rajatomar788 commented Apr 14, 2024

I-dontcode commented Apr 14, 2024 •

edited

Loading

rajatomar788 commented Apr 14, 2024

I-dontcode commented Apr 14, 2024

rajatomar788 commented Apr 14, 2024

I-dontcode commented Apr 14, 2024

rajatomar788 commented Apr 14, 2024

I-dontcode commented Apr 14, 2024

rajatomar788 commented Apr 14, 2024

I-dontcode commented Apr 16, 2024 •

edited

Loading

rajatomar788 commented Apr 17, 2024

I-dontcode commented Apr 17, 2024

rajatomar788 commented Apr 17, 2024

Incomplete Read #127

Incomplete Read #127

Comments

I-dontcode commented Apr 14, 2024

rajatomar788 commented Apr 14, 2024

I-dontcode commented Apr 14, 2024 • edited Loading

rajatomar788 commented Apr 14, 2024

I-dontcode commented Apr 14, 2024

rajatomar788 commented Apr 14, 2024

I-dontcode commented Apr 14, 2024

rajatomar788 commented Apr 14, 2024

I-dontcode commented Apr 14, 2024

rajatomar788 commented Apr 14, 2024

I-dontcode commented Apr 16, 2024 • edited Loading

rajatomar788 commented Apr 17, 2024

I-dontcode commented Apr 17, 2024

rajatomar788 commented Apr 17, 2024

I-dontcode commented Apr 14, 2024 •

edited

Loading

I-dontcode commented Apr 16, 2024 •

edited

Loading