Want a Model trained on a PDF subset of the web #6978

gitclem · 2021-02-08T23:22:48Z

gitclem
Feb 8, 2021

The problem with including all web pages in building a language model, is a lot of the web writing is informal, with bad/inconsistent/misleading punctuation, capitalization, spelling, etc.

E.g., a PDF paper about databases will mention ACID (Atomicity, Consistency, Isolation, Durability) or CRUD (Create, Read, Update, and Delete) words in all uppercase, whereas a casual post in stack overflow might not be so rigorous in being consistent with these words.

Another situation, is a new word never before seen by SpaCy but has an initial capital letter or a capital letter in the middle of the word, or even all upper case should be treated as more likely to be a proper noun when found in a PDF vs. some random web page page in which the normal rules of English have gone out the window (or at least was never proof read.). All upper case might be just random emphasis that the writer just feels like INSERTING FOR NO GOOD REASON!

As a practical matter, a colleague of mine is working with legal contracts, where the text is carefully written. SpaCy does a relatively poor job at NER with his documents compared to naive rules using case and other tricks with Lex.

As a hypothetical example, Dewey, Cheatem & Howe LLP would be more likely to be seen as a law firm instead of three separate names.

ericfeunekes · 2021-02-09T00:00:22Z

ericfeunekes
Feb 9, 2021

I agree that it would be nice. The issue is where the data comes from. It's easy to scrap Wikipedia, Reddit, whatever, but to train a model on legal contracts or other more specific document types you need to find the thousands/millions of them to use for training and annotate them.
If you're finding that rule-based extraction works relatively well, you could use a combination of regex and Spacy's Matcher to bootstrap a dataset and then train the model on your new data.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Want a Model trained on a PDF subset of the web #6978

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Want a Model trained on a PDF subset of the web #6978

gitclem Feb 8, 2021

Replies: 1 comment

ericfeunekes Feb 9, 2021

gitclem
Feb 8, 2021

ericfeunekes
Feb 9, 2021