Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update Makefile and scrape_table.py to pull from updated data source #1

Closed
derekeder opened this issue Aug 28, 2024 · 5 comments · Fixed by #8
Closed

update Makefile and scrape_table.py to pull from updated data source #1

derekeder opened this issue Aug 28, 2024 · 5 comments · Fixed by #8
Assignees
Labels
bug Something isn't working

Comments

@derekeder
Copy link
Member

derekeder commented Aug 28, 2024

current Makefile produces the following error:

Traceback (most recent call last):
  File "/home/user/Documents/chicago-elections-archive/scripts/scrape_table.py", line 69, in <module>
    main()
  File "/home/user/Documents/chicago-elections-archive/scripts/scrape_table.py", line 56, in main
    candidates = get_candidates(soup)
  File "/home/user/Documents/chicago-elections-archive/scripts/scrape_table.py", line 9, in get_candidates
    headers = [v.get_text().strip() for v in soup.find("table").find_all("b")]
AttributeError: 'NoneType' object has no attribute 'find_all'

Looks like the chicagoelections.gov has updated their pages that produce these results. https://chicagoelections.gov/en/election-results-specifics.asp is now a 404 page. This is the new one https://chicagoelections.gov/elections/results/

@yashBhosale yashBhosale added the bug Something isn't working label Aug 30, 2024
@yashBhosale
Copy link
Collaborator

yashBhosale commented Aug 30, 2024

Some color here: the new flow is that https://chicagoelections.gov/elections/results/ provides a menu of election links, then you go into an individual election which is just https://chicagoelections.gov/elections/results/{election_id} which then gives you a menu where you pick a race/ward/precinct and then it pulls up a table of the election data and a link to download that election data as an excel file. Incidentally, you can also get the excel file through https://chicagoelections.gov/elections/results/{election_id}?contest={contest_id}&ward={ward_number}&precinct={precinct_number}

This puts me in the rock/hard place of either using selenium to try to leverage the existing beautifulsoup code OR figuring out how to fetch/process the excel file as ergonomically as possible in the circumstances.

@yashBhosale yashBhosale self-assigned this Aug 30, 2024
@derekeder
Copy link
Member Author

@yashBhosale thanks for digging into this. I think if we can get the data in excel, lets go for it. Even if it requires a significant rewrite - the new code will be much less brittle and likely a lot simpler

@yashBhosale
Copy link
Collaborator

Yeah, that's what I was leaning towards as well. Secondary goal (or maybe an umbrella goal) will actually to be able to retire the makefile altogether, since it's actually only for building the data.

@derekeder
Copy link
Member Author

cool. the makefile may still be useful if we want to chain some commands together, but we'll see.

for the excel data, we can make use of csvkit

@nofurtherinformation
Copy link
Collaborator

nofurtherinformation commented Sep 11, 2024

Hey, I was digging in to this a bit and found some insights. The Excel file that site outputs seems to be malformed in some way, and neither pandas.read_excel nor xlrd can read it from disk or bytes from the server.

An alternative approach could be to just use the HTML that gets returned from the Drupal endpoint, but we don't need selenium. Here's one naive and minimal example:

import requests
from bs4 import BeautifulSoup

contest_id = 205
# Drupal URL for html table
url = f"https://chicagoelections.gov/elections/results/{contest_id}?ajax_form=1&_wrapper_format=drupal_ajax"

# Form data for POST request
form_data = {
    'election_id': contest_id,
    'form_id': 'election_results_form',
}

# send POST request
response = requests.post(url, data=form_data)

# Returns a list of object, last one has data in html table
html_table = response.json()[-1]['data']

# Parse html table
table = BeautifulSoup(html_table, 'html.parser')

# TODO: Clean up table, make into Pandas DataFrame

This looks something like:

...
<thead>
<tr>
<th>Registered Voters</th>
<th>Ballots Cast</th>
<th>Turnout</th>
</tr>
</thead>
<tbody>
<tr>
<td>1,494,199</td>
<td>452,529</td>
<td>30.29%</td>
</tr>
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants