Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a "clean" mode to combat duplicate entries in Legistar at time of scrape #294

Closed
reginafcompton opened this issue Sep 21, 2017 · 2 comments

Comments

@reginafcompton
Copy link

reginafcompton commented Sep 21, 2017

Recently, LA Metro had "duplicate" events in Legistar (i.e., same name and time, but different EventId):

http://webapi.legistar.com/v1/metro/events/1265
http://webapi.legistar.com/v1/metro/events/1259 (now defunct)

The scrapers for Metro run multiple times per day, and at the time of a scrape, both events were present.

We use the EventId to create the unique instance of the Identifier class. So, the importer would not have known these were the same event.

Let's add a "clean" scrape mode to pupa, i.e., the "clean" scrape removes data that does not appear on the legistar web api.

@jamesturk
Copy link
Member

Hmm, right now pupa doesn't delete any top level objects so we'd need to really think about how this was handled. I'm open to discussion/proposals but tend to think that we should favor some other mechanism here.

@reginafcompton
Copy link
Author

Thanks for the reply, @jamesturk, and I agree this is something we'd want to think about carefully before implementation. I've opened a new issue, which broadens our conversation: #295

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants