-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing duplicate words with trailing apostrophe #18
base: master
Are you sure you want to change the base?
Conversation
I do recommend to go one step further and also remove the trailing apostrophes for words ending in in', but without deleting the entry from the dictionary, just removing the trailing apostrophes. |
Hey Antoine, thanks for the patch. Do you really think it is needed? I'm sorry, why not just keep all those words in the dictionary? |
Examples of missing possessive plural forms:
We have about 20000 entries ending with What we could do instead, is to clean only the exact duplicates, and extract the left non-duplicates for analysis. |
b382914
to
6a2d57c
Compare
@nshmyrev I redid the operation to only remove exact duplicates. Here is the new script that I used (so that you can apply it yourself if you wish), on macOS:
|
6a2d57c
to
cc71253
Compare
I've reindexed the non-deleted entries with this script (based on my older stripStress script): #!/usr/bin/swift
//: dict re-indexer
/// usage: `swift reindex.swift`
/// Author: Antoine Cœur
import Foundation
let origin = "cmudict.dict"
let destination = "cmudict-reindexed.dict"
/// https://stackoverflow.com/a/46046008/1033581
class MutableOrderedDictionary: NSDictionary {
let _values: NSMutableArray = []
let _keys: NSMutableOrderedSet = []
override var count: Int {
return _keys.count
}
override func keyEnumerator() -> NSEnumerator {
return _keys.objectEnumerator()
}
override func object(forKey aKey: Any) -> Any? {
let index = _keys.index(of: aKey)
if index != NSNotFound {
return _values[index]
}
return nil
}
func setObject(_ anObject: Any, forKey aKey: String) {
let index = _keys.index(of: aKey)
if index != NSNotFound {
_values[index] = anObject
} else {
_keys.add(aKey)
_values.add(anObject)
}
}
}
let reindex: Void = {
let content = try! String(contentsOf: URL(fileURLWithPath: origin), encoding: .utf8)
let dict = MutableOrderedDictionary()
let regexp = try! NSRegularExpression(pattern: "^([^ \\(]+)[^ ]* (.*)$", options: .anchorsMatchLines)
regexp.enumerateMatches(in: content, options: [], range: NSRange(location: 0, length: content.count), using: { (result, _, _) in
let match1 = String(content[Range(result!.range(at: 1), in: content)!])
let match2 = content[Range(result!.range(at: 2), in: content)!]
let stripped = String(match2)
if let pronunciations = dict[match1] as? NSMutableOrderedSet {
pronunciations.add(stripped)
} else {
dict.setObject(NSMutableOrderedSet(object: stripped), forKey: match1)
}
})
var result = ""
for (word, phonesList) in dict {
let (word, phonesList) = (word as! String, phonesList as! NSMutableOrderedSet)
for (i, phones) in phonesList.enumerated() {
let phones = phones as! String
result.append(word + (i == 0 ? "" : "(\(i + 1))") + " " + phones + "\n")
}
}
try! result.write(to: URL(fileURLWithPath: destination), atomically: true, encoding: .utf8)
}() |
The apostrophes were there for a reason. cmudict was used in ASR to get pronunciations for building a lexicon. One can start dropping things, but it's a bad idea: tidying things up just loses information. A better way is to to post-process. That way you can tidy for your needs. The lextool, http://www.speech.cs.cmu.edu/tools/lextool.html, does that. I don't know offhand if it does the '.'s. But does do stuff like plurals (-S, -ES) that usually safe. |
Command used:
Reasoning:
borrowers'
: My corrections #17 (comment)comin'
,cookin'
, etc. because they had no equivalent pronunciation without the apostropheol'
because it had no equivalent pronunciation without the apostrophe