Want a Model trained on a PDF subset of the web #6978
gitclem
started this conversation in
New Features & Project Ideas
Replies: 1 comment
-
I agree that it would be nice. The issue is where the data comes from. It's easy to scrap Wikipedia, Reddit, whatever, but to train a model on legal contracts or other more specific document types you need to find the thousands/millions of them to use for training and annotate them. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The problem with including all web pages in building a language model, is a lot of the web writing is informal, with bad/inconsistent/misleading punctuation, capitalization, spelling, etc.
E.g., a PDF paper about databases will mention ACID (Atomicity, Consistency, Isolation, Durability) or CRUD (Create, Read, Update, and Delete) words in all uppercase, whereas a casual post in stack overflow might not be so rigorous in being consistent with these words.
Another situation, is a new word never before seen by SpaCy but has an initial capital letter or a capital letter in the middle of the word, or even all upper case should be treated as more likely to be a proper noun when found in a PDF vs. some random web page page in which the normal rules of English have gone out the window (or at least was never proof read.). All upper case might be just random emphasis that the writer just feels like INSERTING FOR NO GOOD REASON!
As a practical matter, a colleague of mine is working with legal contracts, where the text is carefully written. SpaCy does a relatively poor job at NER with his documents compared to naive rules using case and other tricks with Lex.
As a hypothetical example,
Dewey, Cheatem & Howe LLP
would be more likely to be seen as a law firm instead of three separate names.Beta Was this translation helpful? Give feedback.
All reactions