-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find a mechanism to handle data deleted from the source system #295
Comments
The more I think about this, the more I am sure that I don't want a scraping run to delete things from the database. I think a flag that said something like "not found in source system as of such and such data" would be ideal. Thoughts @jamesturk @jpmckinney |
Sure, that's consistent with @reginafcompton's suggestion of "a means to somehow mark information as having been deleted from the source system." |
I agree that a flag or error would be less hazardous, than deleting things outright. @fgregg |
I've been thinking about this a bit, and how it relates to #65 I think it'd be best to have a sort of two-step process:
I thought of a few options on how to adjust the import: A) add a
|
(A) sounds good, and if performance is an issue, we can just collect IDs to touch, and then set the field for all the retrieved but not-updated records in a single UPDATE statement. Instead of |
This issue surfaced again in the context of memberships: Metro-Records/la-metro-councilmatic#746, Metro-Records/la-metro-councilmatic#758. I think adding the |
@hancush I think the idea of a |
there are some good opportunities to connect this to the scraping session reports. Line 69 in 8087e22
|
Y'all! @antidipyramid has implemented this solution:
We should probably cut new versions of both pupa and OCD to cleanly identify when this mechanism became available. |
Looks like @fgregg and @derekeder are the maintainers on PyPI. https://pypi.org/project/pupa/ https://pypi.org/project/opencivicdata/ |
Occasionally, systems (i.e., Legistar web API), from which we regularly scrape, delete information about an event, bill, or person. Data may be deleted because: the information can be inaccurate, it may be test data, or it represents duplicate data - that is, an item may be deleted, then recreated later.
Currently, Pupa does not have an effective mechanism for (1) removing source-deleted data, or (2) indicating that the data has been deleted.
It would be useful to us (and probably others) to identify a way to do accomplish one of these possibilities: a mechanism to eliminate source-deleted information, or a means to somehow mark information as having been deleted from the source system.
@jamesturk let's discuss with @fgregg the preferred behavior, and then, we can consider implementation, later.
The text was updated successfully, but these errors were encountered: