Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find a mechanism to handle data deleted from the source system #295

Open
reginafcompton opened this issue Sep 22, 2017 · 10 comments
Open

Find a mechanism to handle data deleted from the source system #295

reginafcompton opened this issue Sep 22, 2017 · 10 comments

Comments

@reginafcompton
Copy link

Occasionally, systems (i.e., Legistar web API), from which we regularly scrape, delete information about an event, bill, or person. Data may be deleted because: the information can be inaccurate, it may be test data, or it represents duplicate data - that is, an item may be deleted, then recreated later.

Currently, Pupa does not have an effective mechanism for (1) removing source-deleted data, or (2) indicating that the data has been deleted.

It would be useful to us (and probably others) to identify a way to do accomplish one of these possibilities: a mechanism to eliminate source-deleted information, or a means to somehow mark information as having been deleted from the source system.

@jamesturk let's discuss with @fgregg the preferred behavior, and then, we can consider implementation, later.

@fgregg
Copy link
Contributor

fgregg commented Oct 3, 2017

The more I think about this, the more I am sure that I don't want a scraping run to delete things from the database.

I think a flag that said something like

"not found in source system as of such and such data" would be ideal.

Thoughts @jamesturk @jpmckinney

@jpmckinney
Copy link
Member

Sure, that's consistent with @reginafcompton's suggestion of "a means to somehow mark information as having been deleted from the source system."

@reginafcompton
Copy link
Author

I agree that a flag or error would be less hazardous, than deleting things outright. @fgregg

@jamesturk
Copy link
Member

I've been thinking about this a bit, and how it relates to #65

I think it'd be best to have a sort of two-step process:

  • at import time somehow mark items that are 'missing'
  • a command or admin view that cleans up the data after review (and it could also potentially retire legislators instead of deleting them)

I thought of a few options on how to adjust the import:

A) add a last_seen field (name TBD), similar to last_updated but updated even in the case where nothing changed

It'd then be possible to programmatically determine which things haven't been picked up in the most recent scrapes.

Pros:

  • hopefully a relatively simple implementation: roughly just introducing an update statement whenever an item is found to be equal
  • perhaps more flexible than a simple boolean flag

Cons:

  • probably a bit expensive as it'd involve writing to every object that is seen (whereas right now we short circuit and don't save any changes
  • cleanup command would be a bit more complex, you'd need to specify things like 'delete all bills in 2017 session that haven't been seen in a scrape in December' instead of just 'delete all bills that are pending deletion'

B) add a deleted_at_source flag that is set to True when the object goes missing

Pros:

  • simplest implementation of cleanup command, just include anything w/ the flag for review

Cons:

  • implementation might be fairly tricky, as we'd need to figure out if an item is 'expected' in a scrape, for example:

pupa update nc people # shouldn't mark bills as deleted
pupa update nc bills session=2017 # shouldn't mark 2016 bills deleted
pupa update nc votes chamber=upper # shouldn't mark house votes deleted

so we'd need a way for those parameters (currently handled entirely by the scraper) to also influence which objects are considered for deletion

--

I think personally I'm leaning towards A, but open to hearing arguments & other options. Also if someone wants to take a swing at implementing either that'd be a big factor in the decision, but I might give an implementation of last_seen a shot at some point soon.

@jpmckinney
Copy link
Member

(A) sounds good, and if performance is an issue, we can just collect IDs to touch, and then set the field for all the retrieved but not-updated records in a single UPDATE statement.

Instead of last_seen, maybe last_imported, which is more explicitly what happened.

@hancush
Copy link
Contributor

hancush commented Sep 14, 2021

This issue surfaced again in the context of memberships: Metro-Records/la-metro-councilmatic#746, Metro-Records/la-metro-councilmatic#758. I think adding the last_imported timestamp presents a better (and more broadly useful) solution than alternatives such as revising import to (optionally) clear existing data before bringing in new data. If membership deletion is sufficiently prevalent, we'll pick this up.

@antidipyramid
Copy link
Collaborator

@hancush I think the idea of a last_imported timestamp makes a lot of sense. Seems like we'd be looking at creating a new command to clear out objects that haven't been updated. The schedule (periodically or following every scrape) could be determined by the specific deployment (for LA Metro, in the Dashboard). It makes sense to try this out with Persons as those changes have caused many problems in the past and then perhaps expand to other models from there.

@fgregg
Copy link
Contributor

fgregg commented Nov 2, 2022

there are some good opportunities to connect this to the scraping session reports.

class SessionDataQualityReport(models.Model):

@hancush
Copy link
Contributor

hancush commented Mar 16, 2023

Y'all! @antidipyramid has implemented this solution:

We should probably cut new versions of both pupa and OCD to cleanly identify when this mechanism became available.

@jpmckinney
Copy link
Member

Looks like @fgregg and @derekeder are the maintainers on PyPI. https://pypi.org/project/pupa/ https://pypi.org/project/opencivicdata/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

6 participants