parse_response refactor #87

EvilDrPurple · 2024-05-28T21:34:50Z

Fixes

Refactor parse_response() in collector.py #59

Description

Changes made according to the original issue
The tags dictionary is converted to a DataClass specified in DataClassTags.py
A unique function is created for each attribute being pulled and for the verification code

Testing

Clone the repository and checkout the branch
Create a virtual environment in the root directory and install the requirements via requirements.txt
Run python html_tag_collector/collector.py [test_file.csv].
test_file.csv can be any csv with a column labeled 'url' containing a set of urls.

maxachis

Looks pretty good! Before I approve, I'd like to see the module docstring added and the big if-conditional extracted to a separate function. The namedtuple is a nice-to-have, but I won't block on that.

html_tag_collector/collector.py

maxachis · 2024-05-29T11:17:58Z

html_tag_collector/collector.py

    if res is None:
-        tags["http_response"] = -1
-        return tags
+        return False, -1


NIT: I won't block on this, but recommend converting multi-value responses to named tuples. At a glance, I don't know what each of these values represent, and a named tuple will greatly clarify this.

@maxachis like this? I've never used a namedtuple before so correct me if this isn't what you had in mind

def verify_response(res): """Verifies the webpage response is readable and ok. Args: res (HTMLResponse|Response): Response object to verify. Returns: VerifiedResponse(bool, int): A named tuple containing False if verification fails, True otherwise and the http response code. """ VerifiedResponse = namedtuple("VerifiedResponse", "verified http_response") # The response is None if there was an error during connection, meaning there is no content to read if res is None: return VerifiedResponse(False, -1) # If the connection did not return a 200 code, we can assume there is no relevant content to read http_response = res.status_code if not res.ok: return VerifiedResponse(False, http_response) return VerifiedResponse(True, http_response)

Yep! Just make sure to have it outside of the function so that other parts of the code can reference it (and also don't forget that the fields are a comma-delimited list, so VerifiedResponse = namedtuple("VerifiedResponse", ["verified", "http_response"])

maxachis · 2024-05-29T11:19:48Z

html_tag_collector/collector.py

@@ -1,3 +1,4 @@
+from dataclasses import asdict


Add a docstring describing the function of collector.py here at the top of the function. This will make it easier for people in the 🤖 future 🤖 to understand at a glance what this module is doing.

Good description?

""" The tag collector is used to collect HTML tags and other relevant data from websites that is useful for training prediction models. Information being collected includes: - The URL's path - HTML title - Meta description - The root page's HTML title - HTTP response code - Contents of H1-H6 header tags - Contents of div tags """

An excellent description 😀

maxachis

Looks good! Approved!

EvilDrPurple added 12 commits May 23, 2024 12:56

Create dataclass

b88c681

Add dataclass parameters

6251475

Create get_url function

3c5f646

Add get_html_title function

7d1d5eb

Add get_meta_description function

f1bd4df

Add get_header_tags function

3b74673

Add get_div_text function

bb21b18

Add verify_response fuction

22c00e2

Add get_parser function

c4b3ce5

Restructure the functions

927cbb7

Cleanup comments

0c693c3

Format with black

12ad80a

EvilDrPurple requested a review from josh-chamberlain as a code owner May 28, 2024 21:34

EvilDrPurple requested a review from maxachis May 28, 2024 21:35

maxachis requested changes May 29, 2024

View reviewed changes

EvilDrPurple added 3 commits May 29, 2024 10:57

Add check_response

9afb1ee

Convert function return to namedtuple

0137a6f

Add docstring

e40e472

maxachis approved these changes May 29, 2024

View reviewed changes

josh-chamberlain approved these changes May 30, 2024

View reviewed changes

maxachis merged commit ddbea3b into main May 30, 2024
3 checks passed

maxachis deleted the parse-response-59 branch May 30, 2024 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parse_response refactor #87

parse_response refactor #87

EvilDrPurple commented May 28, 2024

maxachis left a comment

maxachis May 29, 2024

EvilDrPurple May 29, 2024

maxachis May 29, 2024 •

edited

Loading

maxachis May 29, 2024

EvilDrPurple May 29, 2024

maxachis May 29, 2024

maxachis left a comment

parse_response refactor #87

parse_response refactor #87

Conversation

EvilDrPurple commented May 28, 2024

Fixes

Description

Testing

maxachis left a comment

Choose a reason for hiding this comment

maxachis May 29, 2024

Choose a reason for hiding this comment

EvilDrPurple May 29, 2024

Choose a reason for hiding this comment

maxachis May 29, 2024 • edited Loading

Choose a reason for hiding this comment

maxachis May 29, 2024

Choose a reason for hiding this comment

EvilDrPurple May 29, 2024

Choose a reason for hiding this comment

maxachis May 29, 2024

Choose a reason for hiding this comment

maxachis left a comment

Choose a reason for hiding this comment

maxachis May 29, 2024 •

edited

Loading