-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parse_response refactor #87
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good! Before I approve, I'd like to see the module docstring added and the big if-conditional extracted to a separate function. The namedtuple is a nice-to-have, but I won't block on that.
html_tag_collector/collector.py
Outdated
if res is None: | ||
tags["http_response"] = -1 | ||
return tags | ||
return False, -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I won't block on this, but recommend converting multi-value responses to named tuples. At a glance, I don't know what each of these values represent, and a named tuple will greatly clarify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxachis like this? I've never used a namedtuple before so correct me if this isn't what you had in mind
def verify_response(res):
"""Verifies the webpage response is readable and ok.
Args:
res (HTMLResponse|Response): Response object to verify.
Returns:
VerifiedResponse(bool, int): A named tuple containing False if verification fails, True otherwise and the http response code.
"""
VerifiedResponse = namedtuple("VerifiedResponse", "verified http_response")
# The response is None if there was an error during connection, meaning there is no content to read
if res is None:
return VerifiedResponse(False, -1)
# If the connection did not return a 200 code, we can assume there is no relevant content to read
http_response = res.status_code
if not res.ok:
return VerifiedResponse(False, http_response)
return VerifiedResponse(True, http_response)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! Just make sure to have it outside of the function so that other parts of the code can reference it (and also don't forget that the fields are a comma-delimited list, so VerifiedResponse = namedtuple("VerifiedResponse", ["verified", "http_response"])
@@ -1,3 +1,4 @@ | |||
from dataclasses import asdict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a docstring describing the function of collector.py
here at the top of the function. This will make it easier for people in the 🤖 future 🤖 to understand at a glance what this module is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good description?
""" The tag collector is used to collect HTML tags and other relevant data from websites that is useful for training prediction models.
Information being collected includes:
- The URL's path
- HTML title
- Meta description
- The root page's HTML title
- HTTP response code
- Contents of H1-H6 header tags
- Contents of div tags
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An excellent description 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Approved!
Fixes
Description
tags
dictionary is converted to a DataClass specified inDataClassTags.py
Testing
requirements.txt
python html_tag_collector/collector.py [test_file.csv]
.test_file.csv
can be any csv with a column labeled 'url' containing a set of urls.