Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse_response refactor #87

Merged
merged 15 commits into from
May 30, 2024
19 changes: 19 additions & 0 deletions html_tag_collector/DataClassTags.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from dataclasses import dataclass

Check warning on line 1 in html_tag_collector/DataClassTags.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/DataClassTags.py#L1 <100>

Missing docstring in public module
Raw output
./html_tag_collector/DataClassTags.py:1:1: D100 Missing docstring in public module


@dataclass
class Tags:

Check warning on line 5 in html_tag_collector/DataClassTags.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/DataClassTags.py#L5 <101>

Missing docstring in public class
Raw output
./html_tag_collector/DataClassTags.py:5:1: D101 Missing docstring in public class
index: int = None
url: str = ""
url_path: str = ""
html_title: str = ""
meta_description: str = ""
root_page_title: str = ""
http_response: int = -1
h1: str = ""
h2: str = ""
h3: str = ""
h4: str = ""
h5: str = ""
h6: str = ""
div_text: str = ""
263 changes: 209 additions & 54 deletions html_tag_collector/collector.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,17 @@
""" The tag collector is used to collect HTML tags and other relevant data from websites that is useful for training prediction models.

Check warning on line 1 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L1 <205>

1 blank line required between summary line and description
Raw output
./html_tag_collector/collector.py:1:1: D205 1 blank line required between summary line and description

Check warning on line 1 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L1 <208>

Docstring is over-indented
Raw output
./html_tag_collector/collector.py:1:1: D208 Docstring is over-indented

Check warning on line 1 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L1 <210>

No whitespaces allowed surrounding docstring text
Raw output
./html_tag_collector/collector.py:1:1: D210 No whitespaces allowed surrounding docstring text

Check failure on line 1 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L1 <501>

line too long (135 > 79 characters)
Raw output
./html_tag_collector/collector.py:1:80: E501 line too long (135 > 79 characters)
Information being collected includes:
- The URL's path
- HTML title
- Meta description
- The root page's HTML title
- HTTP response code
- Contents of H1-H6 header tags
- Contents of div tags
"""


from dataclasses import asdict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a docstring describing the function of collector.py here at the top of the function. This will make it easier for people in the 🤖 future 🤖 to understand at a glance what this module is doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good description?

""" The tag collector is used to collect HTML tags and other relevant data from websites that is useful for training prediction models.
    Information being collected includes:
        - The URL's path
        - HTML title
        - Meta description
        - The root page's HTML title
        - HTTP response code
        - Contents of H1-H6 header tags
        - Contents of div tags
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An excellent description 😀

from collections import namedtuple
import json
import ssl
import urllib3
Expand All @@ -20,6 +34,8 @@

from RootURLCache import RootURLCache
from common import get_user_agent
from DataClassTags import Tags


# Define the list of header tags we want to extract
header_tags = ["h1", "h2", "h3", "h4", "h5", "h6"]
Expand Down Expand Up @@ -162,27 +178,46 @@
except (KeyError, AttributeError):
pass

# If the response size is greater than 10 MB
# or the response is an unreadable content type
# or the response code from the website is not in the 200s
if (
response is not None and len(response.content) > 10000000
or content_type is not None and any(
filtered_type in content_type
for filtered_type in ["pdf", "excel", "msword", "image", "rtf", "zip", "octet", "csv", "json"]
)
or response is not None and not response.ok
):
# Discard the response content to prevent out of memory errors
if DEBUG:
print("Large or unreadable content discarded:", len(response.content), url)
new_response = requests.Response()
new_response.status_code = response.status_code
response = new_response
response = response_valid(response, content_type, url)

return {"index": index, "response": response}


def response_valid(response, content_type, url):
"""Checks the response to see if content is too large, unreadable, or invalid response code. The response is discarded if it is invalid.

Check warning on line 187 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L187 <401>

First line should be in imperative mood
Raw output
./html_tag_collector/collector.py:187:1: D401 First line should be in imperative mood

Check failure on line 187 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L187 <501>

line too long (140 > 79 characters)
Raw output
./html_tag_collector/collector.py:187:80: E501 line too long (140 > 79 characters)

Args:
response (Response): Response object to check.
content_type (str): The content type returned by the website.
url (str): URL that was requested.

Returns:
Response: The response object is returned either unmodified or discarded.

Check failure on line 195 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L195 <501>

line too long (81 > 79 characters)
Raw output
./html_tag_collector/collector.py:195:80: E501 line too long (81 > 79 characters)
"""

Check warning on line 196 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L196 <291>

trailing whitespace
Raw output
./html_tag_collector/collector.py:196:8: W291 trailing whitespace
# If the response size is greater than 10 MB
# or the response is an unreadable content type
# or the response code from the website is not in the 200s
if (
response is not None
and len(response.content) > 10000000
or content_type is not None
and any(
filtered_type in content_type
for filtered_type in ["pdf", "excel", "msword", "image", "rtf", "zip", "octet", "csv", "json"]

Check failure on line 206 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L206 <501>

line too long (106 > 79 characters)
Raw output
./html_tag_collector/collector.py:206:80: E501 line too long (106 > 79 characters)
)
or response is not None
and not response.ok
):
# Discard the response content to prevent out of memory errors
if DEBUG:
print("Large or unreadable content discarded:", len(response.content), url)

Check failure on line 213 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L213 <501>

line too long (87 > 79 characters)
Raw output
./html_tag_collector/collector.py:213:80: E501 line too long (87 > 79 characters)
new_response = requests.Response()
new_response.status_code = response.status_code
response = new_response

Check warning on line 217 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L217 <293>

blank line contains whitespace
Raw output
./html_tag_collector/collector.py:217:1: W293 blank line contains whitespace
return response


async def render_js(urls_responses):
"""Renders JavaScript from a list of urls.

Expand Down Expand Up @@ -231,70 +266,182 @@
"""Parses relevant HTML tags from a Response object into a dictionary.

Args:
url_response (list[dict]): List of dictionaries containing urls and theeir responses.
url_response (list[dict]): List of dictionaries containing urls and their responses.

Check failure on line 269 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L269 <501>

line too long (92 > 79 characters)
Raw output
./html_tag_collector/collector.py:269:80: E501 line too long (92 > 79 characters)

Returns:
list[dict]: List of dictionaries containing urls and relevant HTML tags.
dict: Dictionary containing the url and relevant HTML tags.
"""
remove_excess_whitespace = lambda s: " ".join(s.split()).strip()

tags = {}
tags = Tags()
res = url_response["response"]
tags["index"] = url_response["index"]
tags.index = url_response["index"]

# Drop hostname from urls to reduce training bias
tags.url, tags.url_path = get_url(url_response)

tags.root_page_title = remove_excess_whitespace(root_url_cache.get_title(tags.url))

Check failure on line 280 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L280 <501>

line too long (87 > 79 characters)
Raw output
./html_tag_collector/collector.py:280:80: E501 line too long (87 > 79 characters)

verified, tags.http_response = verify_response(res)
if verified is False:
return asdict(tags)

parser = get_parser(res)
if parser is False:
return asdict(tags)

try:
soup = BeautifulSoup(res.html.html, parser)
except (bs4.builder.ParserRejectedMarkup, AssertionError, AttributeError):
return asdict(tags)

tags.html_title = get_html_title(soup)

tags.meta_description = get_meta_description(soup)

tags = get_header_tags(tags, soup)

tags.div_text = get_div_text(soup)

# Prevents most bs4 memory leaks
if soup.html:
soup.html.decompose()

return asdict(tags)


def get_url(url_response):
"""Returns the url and url_path.

Check warning on line 311 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L311 <401>

First line should be in imperative mood
Raw output
./html_tag_collector/collector.py:311:1: D401 First line should be in imperative mood

Args:
url_response (list[dict]): List of dictionaries containing urls and their responses.

Check failure on line 314 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L314 <501>

line too long (92 > 79 characters)
Raw output
./html_tag_collector/collector.py:314:80: E501 line too long (92 > 79 characters)

Returns:
(str, str): Tuple with the url and url_path.
"""
url = url_response["url"][0]
tags["url"] = url
if not url.startswith("http"):
url = "https://" + url
tags["url_path"] = urlparse(url).path[1:]
new_url = url
if not new_url.startswith("http"):
new_url = "https://" + new_url

tags["html_title"] = ""
tags["meta_description"] = ""
tags["root_page_title"] = remove_excess_whitespace(root_url_cache.get_title(tags["url"]))
# Drop hostname from urls to reduce training bias
url_path = urlparse(new_url).path[1:]
# Remove trailing backslash
if url_path and url_path[-1] == "/":
url_path = url_path[:-1]

return url, url_path


def verify_response(res):
"""Verifies the webpage response is readable and ok.

Check warning on line 334 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L334 <401>

First line should be in imperative mood
Raw output
./html_tag_collector/collector.py:334:1: D401 First line should be in imperative mood

Args:
res (HTMLResponse|Response): Response object to verify.

Returns:
VerifiedResponse(bool, int): A named tuple containing False if verification fails, True otherwise and the http response code.

Check failure on line 340 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L340 <501>

line too long (133 > 79 characters)
Raw output
./html_tag_collector/collector.py:340:80: E501 line too long (133 > 79 characters)
"""
VerifiedResponse = namedtuple("VerifiedResponse", "verified http_response")
# The response is None if there was an error during connection, meaning there is no content to read

Check failure on line 343 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L343 <501>

line too long (103 > 79 characters)
Raw output
./html_tag_collector/collector.py:343:80: E501 line too long (103 > 79 characters)
if res is None:
tags["http_response"] = -1
return tags
return VerifiedResponse(False, -1)

tags["http_response"] = res.status_code
# If the connection did not return a 200 code, we can assume there is no relevant content to read

Check failure on line 347 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L347 <501>

line too long (101 > 79 characters)
Raw output
./html_tag_collector/collector.py:347:80: E501 line too long (101 > 79 characters)
http_response = res.status_code
if not res.ok:
return tags
return VerifiedResponse(False, http_response)

return VerifiedResponse(True, http_response)


def get_parser(res):
"""Retrieves the parser type to use with BeautifulSoup.

Check warning on line 356 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L356 <401>

First line should be in imperative mood
Raw output
./html_tag_collector/collector.py:356:1: D401 First line should be in imperative mood

Args:
res (HTMLResponse|Response): Response object to read the content-type from.

Check failure on line 359 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L359 <501>

line too long (83 > 79 characters)
Raw output
./html_tag_collector/collector.py:359:80: E501 line too long (83 > 79 characters)

Returns:
str|bool: A string of the parser to use, or False if not readable.
"""
# Attempt to read the content-type, set the parser accordingly to avoid warning messages

Check failure on line 364 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L364 <501>

line too long (92 > 79 characters)
Raw output
./html_tag_collector/collector.py:364:80: E501 line too long (92 > 79 characters)
try:
content_type = res.headers["content-type"]
except KeyError:
return tags
return False

# If content type does not contain "html" or "xml" then we can assume that the content is unreadable

Check failure on line 370 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L370 <501>

line too long (104 > 79 characters)
Raw output
./html_tag_collector/collector.py:370:80: E501 line too long (104 > 79 characters)
if "html" in content_type:
parser = "lxml"
elif "xml" in content_type:
parser = "lxml-xml"
else:
return tags
return False

try:
soup = BeautifulSoup(res.html.html, parser)
except (bs4.builder.ParserRejectedMarkup, AssertionError, AttributeError):
return tags
return parser


def get_html_title(soup):
"""Retrieves the HTML title from a BeautifulSoup object.

Check warning on line 382 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L382 <401>

First line should be in imperative mood
Raw output
./html_tag_collector/collector.py:382:1: D401 First line should be in imperative mood

Args:
soup (BeautifulSoup): BeautifulSoup object to pull the HTML title from.

Returns:
str: The HTML title.
"""
html_title = ""

if soup.title is not None and soup.title.string is not None:
tags["html_title"] = remove_excess_whitespace(soup.title.string)
else:
tags["html_title"] = ""
html_title = remove_excess_whitespace(soup.title.string)

return html_title


def get_meta_description(soup):
"""Retrieves the meta description from a BeautifulSoup object.

Check warning on line 399 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L399 <401>

First line should be in imperative mood
Raw output
./html_tag_collector/collector.py:399:1: D401 First line should be in imperative mood

Args:
soup (BeautifulSoup): BeautifulSoup object to pull the meta description from.

Check failure on line 402 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L402 <501>

line too long (85 > 79 characters)
Raw output
./html_tag_collector/collector.py:402:80: E501 line too long (85 > 79 characters)

Returns:
str: The meta description.
"""
meta_tag = soup.find("meta", attrs={"name": "description"})
try:
tags["meta_description"] = remove_excess_whitespace(meta_tag["content"]) if meta_tag is not None else ""
meta_description = remove_excess_whitespace(meta_tag["content"]) if meta_tag is not None else ""

Check failure on line 409 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L409 <501>

line too long (104 > 79 characters)
Raw output
./html_tag_collector/collector.py:409:80: E501 line too long (104 > 79 characters)
except KeyError:
tags["meta_description"] = ""
return ""

return meta_description


def get_header_tags(tags, soup):
"""Updates the Tags DataClass with the header tags.

Check warning on line 417 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L417 <401>

First line should be in imperative mood
Raw output
./html_tag_collector/collector.py:417:1: D401 First line should be in imperative mood

Args:
tags (Tags): DataClass for relevant HTML tags.
soup (BeautifulSoup): BeautifulSoup object to pull the header tags from.

Check failure on line 421 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L421 <501>

line too long (80 > 79 characters)
Raw output
./html_tag_collector/collector.py:421:80: E501 line too long (80 > 79 characters)

Returns:
Tags: DataClass with updated header tags.
"""
for header_tag in header_tags:
headers = soup.find_all(header_tag)
# Retreives and drops headers containing links to reduce training bias
# Retrieves and drops headers containing links to reduce training bias
header_content = [header.get_text(" ", strip=True) for header in headers if not header.a]
tags[header_tag] = json.dumps(header_content, ensure_ascii=False)
tag_content = json.dumps(header_content, ensure_ascii=False)
setattr(tags, header_tag, tag_content)

return tags


def get_div_text(soup):
"""Retrieves the div text from a BeautifulSoup object.

Check warning on line 437 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L437 <401>

First line should be in imperative mood
Raw output
./html_tag_collector/collector.py:437:1: D401 First line should be in imperative mood

Args:
soup (BeautifulSoup): BeautifulSoup object to pull the div text from.

Returns:
str: The div text.
"""
# Extract max 500 words of text from HTML <div>'s
div_text = ""
MAX_WORDS = 500
Expand All @@ -307,14 +454,22 @@
else:
break # Stop adding text if word limit is reached

# truncate to 5000 characters in case of run-on 'words'
tags["div_text"] = div_text[:MAX_WORDS * 10]
# Truncate to 5000 characters in case of run-on 'words'
div_text = div_text[: MAX_WORDS * 10]

# Prevents most bs4 memory leaks
if soup.html:
soup.html.decompose()
return div_text

return tags

def remove_excess_whitespace(s):
"""Removes leading, trailing, and excess adjacent whitespace.

Check warning on line 464 in html_tag_collector/collector.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] html_tag_collector/collector.py#L464 <401>

First line should be in imperative mood
Raw output
./html_tag_collector/collector.py:464:1: D401 First line should be in imperative mood

Args:
s (str): String to remove whitespace from.

Returns:
str: Clean string with excess whitespace stripped.
"""
return " ".join(s.split()).strip()


def collector_main(df, render_javascript=False):
Expand Down