You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A good prior distribution on person names (first names, last name, etc.) -- but many other types of names including place names -- seems important for cases when it is useful to model the possibility of typos occurring in names. If we model an observed name field using a typo model, without an accurate name prior it is easy for the model to infer that a correctly spelled name is actually a version of another name but with typos introduced. I encountered this when writing a simple model of first names. Here is a minimal example:
PClean.@model CustomerModel begin@class FirstNames begin
name ~StringPrior(1, 60, all_given_names)
end@class Person begin
given_name ~ FirstNames
end;
@class Obs beginbegin
person ~ Person
given_name ~AddTypos(person.given_name.name)
endend;
end;
query =@query CustomerModel.Obs [
given_name person.given_name.name given_name
];
observations = [ObservedDataset(query, df)]
config = PClean.InferenceConfig(5, 2; use_mh_instead_of_pg=true)
@timebegin
tr =initialize_trace(observations, config);
run_inference!(tr, config)
end
Coming up with a good name prior seems like a very nontrivial task. Intuitively, if a human were performing this task, they would rely on their prior experience with names, including common spelling and translation / transliterations and knowledge of the variety closely related names with common phonetic origins, etc. A name expert would have a much more accurate name prior than a random person. Also, the statistics of names (frequency distributions, etc.) might vary widely based on the population or sub-population. One longer-term goal could be to develop an accurate name prior that represents the knowledge of a "global name expert".
Intermediate steps could be to
Train a more accurate n-gram text model that is trained on a data set of names.
Train or find an existing deep generative model for names.
Other steps that don't involve coming up with a name prior, but might mitigate the issue mentioned above might be:
Come up with a more precise typo model, or an approximate typo model that somehow alleviates the issue (e.g. by upper-bounding the number of typos in a name). (This should be a separate issue).
Use a large data set of names a directly-observed table in the model. This is equivalent to using a name prior that is a frequency-weighted distribution over these names. (A likely issue with that approach is that if a name is not observed at least once within this data set, then it might be likely to be corrected to name that is).
Change the Pitman-Yor parameters for the underlying name table to better match statistics of real names, and more generally admit more rare names.
Also, a review of the potential consequences of a biased name prior, and approaches to reduce bias in the name priors, and/or mitigate downstream consequences of this bias, could be valuable.
The text was updated successfully, but these errors were encountered:
A good prior distribution on person names (first names, last name, etc.) -- but many other types of names including place names -- seems important for cases when it is useful to model the possibility of typos occurring in names. If we model an observed name field using a typo model, without an accurate name prior it is easy for the model to infer that a correctly spelled name is actually a version of another name but with typos introduced. I encountered this when writing a simple model of first names. Here is a minimal example:
Coming up with a good name prior seems like a very nontrivial task. Intuitively, if a human were performing this task, they would rely on their prior experience with names, including common spelling and translation / transliterations and knowledge of the variety closely related names with common phonetic origins, etc. A name expert would have a much more accurate name prior than a random person. Also, the statistics of names (frequency distributions, etc.) might vary widely based on the population or sub-population. One longer-term goal could be to develop an accurate name prior that represents the knowledge of a "global name expert".
Intermediate steps could be to
Train a more accurate n-gram text model that is trained on a data set of names.
Train or find an existing deep generative model for names.
Other steps that don't involve coming up with a name prior, but might mitigate the issue mentioned above might be:
Come up with a more precise typo model, or an approximate typo model that somehow alleviates the issue (e.g. by upper-bounding the number of typos in a name). (This should be a separate issue).
Use a large data set of names a directly-observed table in the model. This is equivalent to using a name prior that is a frequency-weighted distribution over these names. (A likely issue with that approach is that if a name is not observed at least once within this data set, then it might be likely to be corrected to name that is).
Change the Pitman-Yor parameters for the underlying name table to better match statistics of real names, and more generally admit more rare names.
Also, a review of the potential consequences of a biased name prior, and approaches to reduce bias in the name priors, and/or mitigate downstream consequences of this bias, could be valuable.
The text was updated successfully, but these errors were encountered: