Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse_response refactor #87

Merged
merged 15 commits into from
May 30, 2024
Merged

parse_response refactor #87

merged 15 commits into from
May 30, 2024

Conversation

EvilDrPurple
Copy link
Contributor

Fixes

Description

  • Changes made according to the original issue
  • The tags dictionary is converted to a DataClass specified in DataClassTags.py
  • A unique function is created for each attribute being pulled and for the verification code

Testing

  1. Clone the repository and checkout the branch
  2. Create a virtual environment in the root directory and install the requirements via requirements.txt
  3. Run python html_tag_collector/collector.py [test_file.csv].
    test_file.csv can be any csv with a column labeled 'url' containing a set of urls.

Copy link
Collaborator

@maxachis maxachis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Before I approve, I'd like to see the module docstring added and the big if-conditional extracted to a separate function. The namedtuple is a nice-to-have, but I won't block on that.

html_tag_collector/collector.py Outdated Show resolved Hide resolved
if res is None:
tags["http_response"] = -1
return tags
return False, -1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I won't block on this, but recommend converting multi-value responses to named tuples. At a glance, I don't know what each of these values represent, and a named tuple will greatly clarify this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxachis like this? I've never used a namedtuple before so correct me if this isn't what you had in mind

def verify_response(res):
    """Verifies the webpage response is readable and ok.

    Args:
        res (HTMLResponse|Response): Response object to verify.

    Returns:
        VerifiedResponse(bool, int): A named tuple containing False if verification fails, True otherwise and the http response code.
    """
    VerifiedResponse = namedtuple("VerifiedResponse", "verified http_response")
    # The response is None if there was an error during connection, meaning there is no content to read
    if res is None:
        return VerifiedResponse(False, -1)

    # If the connection did not return a 200 code, we can assume there is no relevant content to read
    http_response = res.status_code
    if not res.ok:
        return VerifiedResponse(False, http_response)
        
    return VerifiedResponse(True, http_response)

Copy link
Collaborator

@maxachis maxachis May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Just make sure to have it outside of the function so that other parts of the code can reference it (and also don't forget that the fields are a comma-delimited list, so VerifiedResponse = namedtuple("VerifiedResponse", ["verified", "http_response"])

@@ -1,3 +1,4 @@
from dataclasses import asdict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a docstring describing the function of collector.py here at the top of the function. This will make it easier for people in the 🤖 future 🤖 to understand at a glance what this module is doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good description?

""" The tag collector is used to collect HTML tags and other relevant data from websites that is useful for training prediction models.
    Information being collected includes:
        - The URL's path
        - HTML title
        - Meta description
        - The root page's HTML title
        - HTTP response code
        - Contents of H1-H6 header tags
        - Contents of div tags
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An excellent description 😀

Copy link
Collaborator

@maxachis maxachis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Approved!

@maxachis maxachis merged commit ddbea3b into main May 30, 2024
3 checks passed
@maxachis maxachis deleted the parse-response-59 branch May 30, 2024 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants