Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate making uuids deterministic/reproducible #238

Closed
patcon opened this issue May 25, 2016 · 4 comments
Closed

Investigate making uuids deterministic/reproducible #238

patcon opened this issue May 25, 2016 · 4 comments

Comments

@patcon
Copy link
Contributor

patcon commented May 25, 2016

Context: https://opencivicdata.slack.com/archives/pupa/p1464187522000022

@jpmckinney
Copy link
Member

The UUID, right now, is a random way of identifying a record. Pupa also has a separate, deterministic way of identifying a record, using get_object. The deterministic way is used to resolve a scraped record to a stored record. The random way is simply used for refer to the record.

However, sometimes, the resolution is incorrect. Two records sometimes need to be merged or unmerged. We can add new properties (like birth date) to disambiguate two people, for example.

The nice thing about the random way is that it is guaranteed to be unique. If we make UUIDs deterministic, then we introduce a new problem: we can have the UUIDs be the same when the records aren't the same. Now we need to disambiguate both the UUID and the full record.

I don't see how UUIDs can be made deterministic without introducing a new way for incorrect collisions to occur.

@jamesturk
Copy link
Member

agree with James M. this seems like a mistake and not what the UUIDs are
intended for and I don't think it'd buy much (as the scraped UUIDs are not
intended to persist in any way)

On May 25, 2016 12:21 PM, "James McKinney" [email protected] wrote:

The UUID, right now, is a random way of identifying a record. Pupa also
has a separate, deterministic way of identifying a record, using
get_object. The deterministic way is used to resolve a scraped record to
a stored record. The random way is simply used for refer to the record.

However, sometimes, the resolution is incorrect. Two records sometimes
need to be merged or unmerged. We can add new properties (like birth date)
to disambiguate two people, for example.

The nice thing about the random way is that it is guaranteed to be
unique. If we make UUIDs deterministic, then we introduce a new problem: we
can have the UUIDs be the same when the records aren't the same. Now we
need to disambiguate both the UUID and the full record.

I don't see how UUIDs can be made deterministic without introducing a new
way for incorrect collisions to occur.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#238 (comment)

@patcon
Copy link
Contributor Author

patcon commented May 25, 2016

Ah, ok, gotcha. I misunderstood the short-term nature of the UUIDs. I knew the cdn scrapers were treating them as very transient, but I didn't realize that was so purposeful. Thanks all

@patcon patcon closed this as completed May 25, 2016
@jpmckinney
Copy link
Member

jpmckinney commented May 25, 2016

I think the pain point from the Slack conversation is best resolved by not clearing the DB before each scrape. However, if we stop clearing the DB in the scrapers for Represent, that creates new issues - namely, that Pupa has no way of automatically setting an end date on the memberships of representatives who were in a past scrape but not the current scrape.

Since Represent doesn't care about UUIDs, clearing the DB is fine for its use case.

For a single jurisdiction project (like a Councilmatic instance), Represent's pain point is not really felt, since it's easy to manually set an end date on a past membership within a single jurisdiction, which will occur rarely. Represent deals with 100 jurisdictions, so anything manual is a major maintenance burden, as things that are rare in one jurisdiction become common when you are managing 100.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants