Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser cleanup #156

Merged
merged 2 commits into from
Oct 6, 2024
Merged

Parser cleanup #156

merged 2 commits into from
Oct 6, 2024

Conversation

aarosan
Copy link
Collaborator

@aarosan aarosan commented Sep 25, 2024

parser: break this module into smaller, more readable methods #106

parser: update logic to not require 'test' as a parameter and instead pass everything needed for a test within the unit test #111

parser: add type specification to methods #110

parser: add logging #109

parser: add error handling #108

logger.info("Logger configured")
return logger

def get_class_and_method(self, logger, county: str, test = False) -> Tuple[Optional[object], Optional[callable]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small thing and nothing to do on this, but I think we should note that get_class_and_method should return an object (the class) and callable (the method)--ie. not optional--or else I'd consider it an error.

return logger

def get_class_and_method(self, logger, county: str, test = False) -> Tuple[Optional[object], Optional[callable]]:
if test:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, nothing to do here, but a small note. I think in the next evolution of this code, instead of passing a test field, we'd instead pass all of the fields via unit test that the method might need. For example, if the test parser only parses a specific mock HTML stored in the test files folder, instead of putting the HTML folder path-designating logic within the code pivoting based on whether test is true, we'd instead pass the method the HTML folder path entirely. This reason being, when we test, we want to test the method logic itself rather than test only the test version. LMK if that doesn't make sense.

logger.info(f"Unexpected error: {e}")
return None, None

def get_directories(self, county: str, logger, parse_single_file: bool = False) -> Tuple[str, str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like parse_single_file is meant for testing purposes. We may move away from using "test" as a field or "parse_single_file" as a second field and more towards passing the method the URL. I think overall, knowing which directories to use or not use may be the only piece the pivots based on whether we're using a mock for testing or in production.


parser_instance, parser_function = self.get_class_and_method(county=county, logger=logger, test=test)

if parser_instance is not None and parser_function is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably a carry over from the scraper structure, but Matt had a great point on my code there that ultimately I think that parser_instance and parser_function should be not None by the time it gets here and it probably isn't necessary to check this. Ultimately, this will error out anyway if you try to run None with the fields you're passing it.

for charge_name, severity in CHARGE_SEVERITY.items():
if charge_name in charge:
return severity
return float('inf')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool! I had to look this up.

bondsman_rows = bondsman_rows[::-1]
if bondsman_rows[0][0] != "Bondsman":
bondsman_rows = []
def get_case_metadata(self, county: str, case_number: str, case_soup: BeautifulSoup, logger) -> Dict[str, str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this was broken out and placed in this bucket. Awesome.

"location": "Unknown"
}

def parse_defendant_rows(self, defendant_rows: List[List[str]], logger) -> Dict[str, str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment. This is the hays-specific parser, and I imagine that this data will be different from one county to the next. I imagine this would be very much county-specific, but as we get into different counties, we'll probably think of creating a helpers functions that works with other functions that don't vary much from county to county. TBD.

logger.info(f"Error getting disposition information: {e}")
return dispositions

def parser_hays(self, county: str, case_number: str, logger, case_soup: BeautifulSoup) -> Dict[str, Dict]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the formatting of this method and how it's all come together section by section.

@@ -1,16 +1,20 @@
import unittest, sys, os, json, warnings, requests, logging
from unittest.mock import patch, MagicMock, mock_open
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think overall for the unit test additions, I will have to look deeper into using mock library and how it's used here. But otherwise, it's okay to push to main once we can figure out why that one unit test isn't passing.

Great work!

Copy link
Contributor

@nicolassaw nicolassaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work overall. Thanks so much for your work. I've left a few comments below. The only change I'd request is that we fix that one unit test from not passing (looks like there is some folder not being found?), then push this branch again to the repo, we'll confirm that the unit test is passing, then we can approve and merge this PR. Amazing! Thank you!

- Added type specification to methods
- Added logging functionality
- Added error handling
- Refactored into smaller, more readble methods
@newswim newswim added the parser Improving the parsing capabilities label Oct 6, 2024
@newswim newswim merged commit 1257d86 into main Oct 6, 2024
1 of 2 checks passed
@newswim newswim deleted the parser-cleanup branch October 6, 2024 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser Improving the parsing capabilities
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants