Find a mechanism to handle data deleted from the source system #295

reginafcompton · 2017-09-22T15:03:11Z

Occasionally, systems (i.e., Legistar web API), from which we regularly scrape, delete information about an event, bill, or person. Data may be deleted because: the information can be inaccurate, it may be test data, or it represents duplicate data - that is, an item may be deleted, then recreated later.

Currently, Pupa does not have an effective mechanism for (1) removing source-deleted data, or (2) indicating that the data has been deleted.

It would be useful to us (and probably others) to identify a way to do accomplish one of these possibilities: a mechanism to eliminate source-deleted information, or a means to somehow mark information as having been deleted from the source system.

@jamesturk let's discuss with @fgregg the preferred behavior, and then, we can consider implementation, later.

fgregg · 2017-10-03T03:12:56Z

The more I think about this, the more I am sure that I don't want a scraping run to delete things from the database.

I think a flag that said something like

"not found in source system as of such and such data" would be ideal.

Thoughts @jamesturk @jpmckinney

jpmckinney · 2017-10-03T14:06:54Z

Sure, that's consistent with @reginafcompton's suggestion of "a means to somehow mark information as having been deleted from the source system."

reginafcompton · 2017-10-03T14:38:45Z

I agree that a flag or error would be less hazardous, than deleting things outright. @fgregg

jamesturk · 2017-12-05T14:21:36Z

I've been thinking about this a bit, and how it relates to #65

I think it'd be best to have a sort of two-step process:

at import time somehow mark items that are 'missing'
a command or admin view that cleans up the data after review (and it could also potentially retire legislators instead of deleting them)

I thought of a few options on how to adjust the import:

A) add a `last_seen` field (name TBD), similar to `last_updated` but updated even in the case where nothing changed

It'd then be possible to programmatically determine which things haven't been picked up in the most recent scrapes.

Pros:

hopefully a relatively simple implementation: roughly just introducing an update statement whenever an item is found to be equal
perhaps more flexible than a simple boolean flag

Cons:

probably a bit expensive as it'd involve writing to every object that is seen (whereas right now we short circuit and don't save any changes
cleanup command would be a bit more complex, you'd need to specify things like 'delete all bills in 2017 session that haven't been seen in a scrape in December' instead of just 'delete all bills that are pending deletion'

B) add a `deleted_at_source` flag that is set to True when the object goes missing

Pros:

simplest implementation of cleanup command, just include anything w/ the flag for review

Cons:

implementation might be fairly tricky, as we'd need to figure out if an item is 'expected' in a scrape, for example:

pupa update nc people # shouldn't mark bills as deleted
pupa update nc bills session=2017 # shouldn't mark 2016 bills deleted
pupa update nc votes chamber=upper # shouldn't mark house votes deleted

so we'd need a way for those parameters (currently handled entirely by the scraper) to also influence which objects are considered for deletion

--

I think personally I'm leaning towards A, but open to hearing arguments & other options. Also if someone wants to take a swing at implementing either that'd be a big factor in the decision, but I might give an implementation of last_seen a shot at some point soon.

jpmckinney · 2017-12-06T21:20:53Z

(A) sounds good, and if performance is an issue, we can just collect IDs to touch, and then set the field for all the retrieved but not-updated records in a single UPDATE statement.

Instead of last_seen, maybe last_imported, which is more explicitly what happened.

hancush · 2021-09-14T15:49:19Z

This issue surfaced again in the context of memberships: Metro-Records/la-metro-councilmatic#746, Metro-Records/la-metro-councilmatic#758. I think adding the last_imported timestamp presents a better (and more broadly useful) solution than alternatives such as revising import to (optionally) clear existing data before bringing in new data. If membership deletion is sufficiently prevalent, we'll pick this up.

antidipyramid · 2022-11-02T18:38:51Z

@hancush I think the idea of a last_imported timestamp makes a lot of sense. Seems like we'd be looking at creating a new command to clear out objects that haven't been updated. The schedule (periodically or following every scrape) could be determined by the specific deployment (for LA Metro, in the Dashboard). It makes sense to try this out with Persons as those changes have caused many problems in the past and then perhaps expand to other models from there.

fgregg · 2022-11-02T19:21:39Z

there are some good opportunities to connect this to the scraping session reports.

pupa/pupa/models.py

Line 69 in 8087e22

class SessionDataQualityReport(models.Model):

hancush · 2023-03-16T20:44:35Z

Y'all! @antidipyramid has implemented this solution:

We should probably cut new versions of both pupa and OCD to cleanly identify when this mechanism became available.

jpmckinney · 2023-03-16T21:21:55Z

Looks like @fgregg and @derekeder are the maintainers on PyPI. https://pypi.org/project/pupa/ https://pypi.org/project/opencivicdata/

reginafcompton mentioned this issue Sep 22, 2017

Add a "clean" mode to combat duplicate entries in Legistar at time of scrape #294

Closed

reginafcompton mentioned this issue Oct 6, 2017

Home/Meetings: Meeting listed twice Metro-Records/la-metro-councilmatic#222

Closed

hancush mentioned this issue Jan 3, 2018

Contemplate strategy for duplicate bills opencivicdata/scrapers-us-municipal#196

Open

hancush mentioned this issue Jan 18, 2018

Why are some bills missing tags on person detail? datamade/nyc-council-councilmatic#97

Closed

reginafcompton mentioned this issue Jan 19, 2018

First search result for taxi (which we encourage the user to try out) is bill that has been deleted rom legistar datamade/nyc-council-councilmatic#87

Closed

hancush mentioned this issue May 18, 2018

get() returned more than one EventParticipant -- it returned 2! Metro-Records/la-metro-councilmatic#305

Closed

reginafcompton mentioned this issue Jul 9, 2018

Committees: Committees with no current members are listed Metro-Records/la-metro-councilmatic#314

Closed

reginafcompton mentioned this issue Jan 18, 2019

Board Reports: Unpublished reports showing on website! Metro-Records/la-metro-councilmatic#345

Closed

hancush mentioned this issue Jan 16, 2020

Home and Events pages: Feb 2020 Meetings listed 2x Metro-Records/la-metro-councilmatic#536

Closed

hancush mentioned this issue Mar 27, 2020

Orphaned identifiers result in duplicate objects #331

Open

fgregg mentioned this issue Aug 4, 2020

Handle private bills during event packet build Metro-Records/la-metro-councilmatic#637

Merged

1 task

hancush mentioned this issue Nov 2, 2022

Improve the behavior of the person scrape Metro-Records/la-metro-councilmatic#900

Closed

hancush mentioned this issue Dec 16, 2022

A Resolution has been removed and replaced with an Ordinance datamade/chi-councilmatic#307

Open

antidipyramid mentioned this issue Jan 10, 2023

Stop updating OCDBase.updated_at on every save opencivicdata/python-opencivicdata#146

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find a mechanism to handle data deleted from the source system #295

Find a mechanism to handle data deleted from the source system #295

reginafcompton commented Sep 22, 2017

fgregg commented Oct 3, 2017

jpmckinney commented Oct 3, 2017

reginafcompton commented Oct 3, 2017

jamesturk commented Dec 5, 2017

jpmckinney commented Dec 6, 2017

hancush commented Sep 14, 2021 •

edited

Loading

antidipyramid commented Nov 2, 2022

fgregg commented Nov 2, 2022

hancush commented Mar 16, 2023

jpmckinney commented Mar 16, 2023

Find a mechanism to handle data deleted from the source system #295

Find a mechanism to handle data deleted from the source system #295

Comments

reginafcompton commented Sep 22, 2017

fgregg commented Oct 3, 2017

jpmckinney commented Oct 3, 2017

reginafcompton commented Oct 3, 2017

jamesturk commented Dec 5, 2017

A) add a last_seen field (name TBD), similar to last_updated but updated even in the case where nothing changed

B) add a deleted_at_source flag that is set to True when the object goes missing

jpmckinney commented Dec 6, 2017

hancush commented Sep 14, 2021 • edited Loading

antidipyramid commented Nov 2, 2022

fgregg commented Nov 2, 2022

hancush commented Mar 16, 2023

jpmckinney commented Mar 16, 2023

A) add a `last_seen` field (name TBD), similar to `last_updated` but updated even in the case where nothing changed

B) add a `deleted_at_source` flag that is set to True when the object goes missing

hancush commented Sep 14, 2021 •

edited

Loading