Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding updated scripts to fetch the data and convert it into json #8

Merged
merged 8 commits into from
Oct 1, 2024

Conversation

yashBhosale
Copy link
Collaborator

@yashBhosale yashBhosale commented Sep 17, 2024

Working on a new script to fetch all of the election data. This is not the finished script but it is at 90%. Finishing touches needed. Will close #1 when complete.

@yashBhosale
Copy link
Collaborator Author

This code could likely be pared down a bit - i was mostly putting the tracks down on as I was driving over them. Ideally I can finish this in the morning and we can merge this.

pairs = [(contest, race) for contest, c_info in results_metadata.items() for race in c_info["races"]]
#print(len(pairs))
#async with ClientSession() as cs:
# async with cs.get("https://chicagoelections.gov/elections/results/156/download?contest=15&ward=&precinct=") as resp:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def fetch(race: int, contest: int, cs: ClientSession):
 
   resp = await cs.get("https://chicagoelections.gov/elections/results/{race}/download?contest={content}&ward=&precinct=")
   return book_pandas(await resp.content.read(), race, contest)
 

We can asyncio.gather this and it should reduce time wasted on I/o which is likely the biggest performance killer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also need to pool this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also cache the request response. these will likely never change




def book_pandas(book: BytesIO):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def book_pandas(book, race, contest):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a todo note for later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah

cols[i] = f"{cols[i-1]} %"
subtables[ward] = pd.DataFrame(sub_table[1:], columns=sub_table[0]).set_index('Precinct').to_dict(orient="index")
cur_row = next(rows, None)
dump(subtables, open("subtable.json", 'w'))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`f"{race}_{contest}.json" or something. Oh and probably pickle for prod.

except StopIteration:
pass
cols = sub_table[0]
print(sub_table)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

cols = sub_table[0]
print(sub_table)
print(cols)
for i in range(len(cols)):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'cols = [col if col != '%' else cols[i-1] + " %" for i, cols in enumerate(cols)]'

from json import load, dump
from asyncio import run


Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check for edge cases.

#async with ClientSession() as cs:
# async with cs.get("https://chicagoelections.gov/elections/results/156/download?contest=15&ward=&precinct=") as resp:
# book_pandas(await resp.content.read())
book_pandas(open("/home/yash/Downloads/download.xls", "rb").read())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Remove.

Copy link
Member

@derekeder derekeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an initial review. I think for this to be ready to bring in we'll want to have it:

  • save the results in the same place as the old script
  • remove the unused scraper code that this replaces
  • update the readme as necessary

We can do it here, or in a future PR, but the elections.json file seems like something we could scrape and generate dynamically based on the HTML on this page: https://chicagoelections.gov/elections/results




def book_pandas(book: BytesIO):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a todo note for later?

pairs = [(contest, race) for contest, c_info in results_metadata.items() for race in c_info["races"]]
#print(len(pairs))
#async with ClientSession() as cs:
# async with cs.get("https://chicagoelections.gov/elections/results/156/download?contest=15&ward=&precinct=") as resp:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also cache the request response. these will likely never change

@yashBhosale
Copy link
Collaborator Author

yashBhosale commented Sep 24, 2024

So this code works and is largely performant (although I have some thoughts about how to improve performance) - I'm not really sure what the desired output format is.... right now it just puts everything into a big ol json
Once I figure that out it should be trivial to convert?

@yashBhosale
Copy link
Collaborator Author

Never mind, I did figure it out. It's trivial enough that I'll do it in the morning and hopefully this should be good enough to merge.

@derekeder
Copy link
Member

@yashBhosale sounds good! we can plan to look at it tonight

Copy link

netlify bot commented Sep 25, 2024

Deploy Preview for chicago-election-archive failed.

Name Link
🔨 Latest commit b6caed1
🔍 Latest deploy log https://app.netlify.com/sites/chicago-election-archive/deploys/66fbe429d26094000929dbc0

@yashBhosale yashBhosale merged commit fd38ec8 into main Oct 1, 2024
0 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

update Makefile and scrape_table.py to pull from updated data source
2 participants