Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing duplicate words with trailing apostrophe #18

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Coeur
Copy link
Contributor

@Coeur Coeur commented Oct 30, 2018

Command used:

sed -i '' "/[^ln]'[ (]/d" ./cmudict.dict

Reasoning:

  • words ending with a trailing apostrophe should have the same pronunciation as without the trailing apostrophes, so it's duplicate information
  • if we were to keep plural possessive forms in the dictionary, then there would be a LOT of redundant entries to add
  • some of those had incorrect data, like borrowers': My corrections #17 (comment)
  • I kept words ending in in', like comin', cookin', etc. because they had no equivalent pronunciation without the apostrophe
  • I kept the word ol' because it had no equivalent pronunciation without the apostrophe
  • In practice, the regex essentially removed words ending in s', x', z', h'

@Coeur
Copy link
Contributor Author

Coeur commented Oct 30, 2018

I do recommend to go one step further and also remove the trailing apostrophes for words ending in in', but without deleting the entry from the dictionary, just removing the trailing apostrophes.

@nshmyrev
Copy link
Contributor

Hey Antoine, thanks for the patch. Do you really think it is needed? I'm sorry, why not just keep all those words in the dictionary?

@Coeur
Copy link
Contributor Author

Coeur commented Oct 31, 2018

@nshmyrev

if we were to keep plural possessive forms in the dictionary, then there would be a LOT of redundant entries to add

Examples of missing possessive plural forms:

  • abandons'
  • abatements'
  • abductees'
  • ...
  • zoos'
  • zucchinis'
  • zulus'

We have about 20000 entries ending with s or z, but only 765 entries ending with s' or z', and they are almost all of them duplicate of the entry without trailing apostrophe.

What we could do instead, is to clean only the exact duplicates, and extract the left non-duplicates for analysis.

@Coeur
Copy link
Contributor Author

Coeur commented Oct 31, 2018

@nshmyrev I redid the operation to only remove exact duplicates. Here is the new script that I used (so that you can apply it yourself if you wish), on macOS:

#removing trailing apostrophes and duplicates
perl -i -lne "s/'([ \(])/\1/; print if ! \$x{\$_}++" ./cmudict.dict

#converting to patch
git diff ./cmudict.dict > patch.diff
git checkout ./cmudict.dict

#re-adding trailing apostrophes for non-duplicates
sed -i '' -E "s/\+([a-z-]+)/+\1'/g" ./patch.diff

#applying the patch
git apply patch.diff
rm patch.diff

@Coeur Coeur changed the title Removing most words with trailing apostrophe Removing duplicate words with trailing apostrophe Oct 31, 2018
@Coeur
Copy link
Contributor Author

Coeur commented Nov 1, 2018

I've reindexed the non-deleted entries with this script (based on my older stripStress script):

#!/usr/bin/swift
//: dict re-indexer
/// usage: `swift reindex.swift`
/// Author: Antoine Cœur

import Foundation

let origin = "cmudict.dict"
let destination = "cmudict-reindexed.dict"

/// https://stackoverflow.com/a/46046008/1033581
class MutableOrderedDictionary: NSDictionary {
    let _values: NSMutableArray = []
    let _keys: NSMutableOrderedSet = []
    
    override var count: Int {
        return _keys.count
    }
    override func keyEnumerator() -> NSEnumerator {
        return _keys.objectEnumerator()
    }
    override func object(forKey aKey: Any) -> Any? {
        let index = _keys.index(of: aKey)
        if index != NSNotFound {
            return _values[index]
        }
        return nil
    }
    func setObject(_ anObject: Any, forKey aKey: String) {
        let index = _keys.index(of: aKey)
        if index != NSNotFound {
            _values[index] = anObject
        } else {
            _keys.add(aKey)
            _values.add(anObject)
        }
    }
}

let reindex: Void = {
    let content = try! String(contentsOf: URL(fileURLWithPath: origin), encoding: .utf8)
    let dict = MutableOrderedDictionary()
    let regexp = try! NSRegularExpression(pattern: "^([^ \\(]+)[^ ]* (.*)$", options: .anchorsMatchLines)
    regexp.enumerateMatches(in: content, options: [], range: NSRange(location: 0, length: content.count), using: { (result, _, _) in
        let match1 = String(content[Range(result!.range(at: 1), in: content)!])
        let match2 = content[Range(result!.range(at: 2), in: content)!]
        let stripped = String(match2)
        if let pronunciations = dict[match1] as? NSMutableOrderedSet {
            pronunciations.add(stripped)
        } else {
            dict.setObject(NSMutableOrderedSet(object: stripped), forKey: match1)
        }
    })
    var result = ""
    for (word, phonesList) in dict {
        let (word, phonesList) = (word as! String, phonesList as! NSMutableOrderedSet)
        for (i, phones) in phonesList.enumerated() {
            let phones = phones as! String
            result.append(word + (i == 0 ? "" : "(\(i + 1))") + " " + phones + "\n")
        }
    }
    try! result.write(to: URL(fileURLWithPath: destination), atomically: true, encoding: .utf8)
}()

@Alexir
Copy link

Alexir commented Jun 1, 2023

The apostrophes were there for a reason. cmudict was used in ASR to get pronunciations for building a lexicon. One can start dropping things, but it's a bad idea: tidying things up just loses information. A better way is to to post-process. That way you can tidy for your needs. The lextool, http://www.speech.cs.cmu.edu/tools/lextool.html, does that. I don't know offhand if it does the '.'s. But does do stuff like plurals (-S, -ES) that usually safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants