How to encode keywords in spacy/scispacy? #5243
Replies: 1 comment
-
The vectors in these models are restricted to the most common set of words from the data they were trained on, for The token "proteinA" will not actually be such a common word, so it's expected that it gets a zero vector. But if you try with actual gene/protein names, like "BRCA1" or "ESR", I do get a non-zero vector with That said, I'm not even sure you'd want to encode the actual names of the genes and the proteins. Back in the days when I was doing BioNLP myself, I would actually blind the protein & gene names in the text, to create more generic features. You don't want the NLP algorithm to learn that two specific proteins bind with eachother, you want it to learn the grammatical/lexical structures used to describe that binding, without being biased towards the actual names. You want it to start recognizing patterns like "PROT interacts with PROT", without caring what the actual PROT is, because then you'll also be able to pick up novel interactions from literature. |
Beta Was this translation helpful? Give feedback.
-
Hello,
I'm currently working on some biomedical text mining.
The idea is, I have a set of abstracts of 50 words each, and in each one I have the same two keywords such as ProteinA and ProteinB.
The output would be if the two proteins are connected or not in said abstract aka a 0 or 1 output.
I have attempted to code each word of the abstract with spacy or scispacy and feed the resulting (50,word_vector_length) to my neural network.
The trouble is, neither spacy or scispacy seem able to encode the keywords.
import spacy
spa = spacy.load("en_core_web_lg")
sci = spacy.load("en_core_sci_lg")
print(spa("proteinA").vector) #results in zero vector
print(spa("proteinB").vector) #results in zero vector
print(sci("proteinA").vector) #results in zero vector
print(sci("proteinB").vector) #results in zero vector
What can I do about this situation?
Are there special keywords that either of these embeddings will latch onto?
Can I somehow train these embeddings to suit my needs?
Perhaps there's a way to include spacy in the keras Embedding layer?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions