Removing duplicate words with trailing apostrophe #18

Coeur · 2018-10-30T06:12:43Z

Command used:

sed -i '' "/[^ln]'[ (]/d" ./cmudict.dict

Reasoning:

words ending with a trailing apostrophe should have the same pronunciation as without the trailing apostrophes, so it's duplicate information
if we were to keep plural possessive forms in the dictionary, then there would be a LOT of redundant entries to add
some of those had incorrect data, like borrowers': My corrections #17 (comment)
I kept words ending in in', like comin', cookin', etc. because they had no equivalent pronunciation without the apostrophe
I kept the word ol' because it had no equivalent pronunciation without the apostrophe
In practice, the regex essentially removed words ending in s', x', z', h'

Coeur · 2018-10-30T06:16:29Z

I do recommend to go one step further and also remove the trailing apostrophes for words ending in in', but without deleting the entry from the dictionary, just removing the trailing apostrophes.

nshmyrev · 2018-10-30T13:53:39Z

Hey Antoine, thanks for the patch. Do you really think it is needed? I'm sorry, why not just keep all those words in the dictionary?

Coeur · 2018-10-31T02:03:34Z

@nshmyrev

if we were to keep plural possessive forms in the dictionary, then there would be a LOT of redundant entries to add

Examples of missing possessive plural forms:

abandons'
abatements'
abductees'
...
zoos'
zucchinis'
zulus'

We have about 20000 entries ending with s or z, but only 765 entries ending with s' or z', and they are almost all of them duplicate of the entry without trailing apostrophe.

What we could do instead, is to clean only the exact duplicates, and extract the left non-duplicates for analysis.

Coeur · 2018-10-31T06:37:50Z

@nshmyrev I redid the operation to only remove exact duplicates. Here is the new script that I used (so that you can apply it yourself if you wish), on macOS:

#removing trailing apostrophes and duplicates
perl -i -lne "s/'([ \(])/\1/; print if ! \$x{\$_}++" ./cmudict.dict

#converting to patch
git diff ./cmudict.dict > patch.diff
git checkout ./cmudict.dict

#re-adding trailing apostrophes for non-duplicates
sed -i '' -E "s/\+([a-z-]+)/+\1'/g" ./patch.diff

#applying the patch
git apply patch.diff
rm patch.diff

Coeur · 2018-11-01T02:54:29Z

I've reindexed the non-deleted entries with this script (based on my older stripStress script):

#!/usr/bin/swift
//: dict re-indexer
/// usage: `swift reindex.swift`
/// Author: Antoine Cœur

import Foundation

let origin = "cmudict.dict"
let destination = "cmudict-reindexed.dict"

/// https://stackoverflow.com/a/46046008/1033581
class MutableOrderedDictionary: NSDictionary {
    let _values: NSMutableArray = []
    let _keys: NSMutableOrderedSet = []
    
    override var count: Int {
        return _keys.count
    }
    override func keyEnumerator() -> NSEnumerator {
        return _keys.objectEnumerator()
    }
    override func object(forKey aKey: Any) -> Any? {
        let index = _keys.index(of: aKey)
        if index != NSNotFound {
            return _values[index]
        }
        return nil
    }
    func setObject(_ anObject: Any, forKey aKey: String) {
        let index = _keys.index(of: aKey)
        if index != NSNotFound {
            _values[index] = anObject
        } else {
            _keys.add(aKey)
            _values.add(anObject)
        }
    }
}

let reindex: Void = {
    let content = try! String(contentsOf: URL(fileURLWithPath: origin), encoding: .utf8)
    let dict = MutableOrderedDictionary()
    let regexp = try! NSRegularExpression(pattern: "^([^ \\(]+)[^ ]* (.*)$", options: .anchorsMatchLines)
    regexp.enumerateMatches(in: content, options: [], range: NSRange(location: 0, length: content.count), using: { (result, _, _) in
        let match1 = String(content[Range(result!.range(at: 1), in: content)!])
        let match2 = content[Range(result!.range(at: 2), in: content)!]
        let stripped = String(match2)
        if let pronunciations = dict[match1] as? NSMutableOrderedSet {
            pronunciations.add(stripped)
        } else {
            dict.setObject(NSMutableOrderedSet(object: stripped), forKey: match1)
        }
    })
    var result = ""
    for (word, phonesList) in dict {
        let (word, phonesList) = (word as! String, phonesList as! NSMutableOrderedSet)
        for (i, phones) in phonesList.enumerated() {
            let phones = phones as! String
            result.append(word + (i == 0 ? "" : "(\(i + 1))") + " " + phones + "\n")
        }
    }
    try! result.write(to: URL(fileURLWithPath: destination), atomically: true, encoding: .utf8)
}()

Alexir · 2023-06-01T12:41:54Z

The apostrophes were there for a reason. cmudict was used in ASR to get pronunciations for building a lexicon. One can start dropping things, but it's a bad idea: tidying things up just loses information. A better way is to to post-process. That way you can tidy for your needs. The lextool, http://www.speech.cs.cmu.edu/tools/lextool.html, does that. I don't know offhand if it does the '.'s. But does do stuff like plurals (-S, -ES) that usually safe.

Coeur force-pushed the clean-trailing-apostrophes branch from b382914 to 6a2d57c Compare October 31, 2018 06:33

removing duplicate entries with a trailing apostrophe

cc71253

Coeur force-pushed the clean-trailing-apostrophes branch from 6a2d57c to cc71253 Compare October 31, 2018 06:52

Coeur changed the title ~~Removing most words with trailing apostrophe~~ Removing duplicate words with trailing apostrophe Oct 31, 2018

reindexing

3b0cc62

Coeur mentioned this pull request Nov 1, 2018

List of words with apostrophe and inconsistent pronunciations #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing duplicate words with trailing apostrophe #18

Removing duplicate words with trailing apostrophe #18

Coeur commented Oct 30, 2018

Coeur commented Oct 30, 2018 •

edited

Loading

nshmyrev commented Oct 30, 2018

Coeur commented Oct 31, 2018

Coeur commented Oct 31, 2018 •

edited

Loading

Coeur commented Nov 1, 2018 •

edited

Loading

Alexir commented Jun 1, 2023

Removing duplicate words with trailing apostrophe #18

Are you sure you want to change the base?

Removing duplicate words with trailing apostrophe #18

Conversation

Coeur commented Oct 30, 2018

Coeur commented Oct 30, 2018 • edited Loading

nshmyrev commented Oct 30, 2018

Coeur commented Oct 31, 2018

Coeur commented Oct 31, 2018 • edited Loading

Coeur commented Nov 1, 2018 • edited Loading

Alexir commented Jun 1, 2023

Coeur commented Oct 30, 2018 •

edited

Loading

Coeur commented Oct 31, 2018 •

edited

Loading

Coeur commented Nov 1, 2018 •

edited

Loading