How to encode keywords in spacy/scispacy? #5243

vectorkt · 2020-04-02T09:49:12Z

vectorkt
Apr 2, 2020

Hello,

I'm currently working on some biomedical text mining.
The idea is, I have a set of abstracts of 50 words each, and in each one I have the same two keywords such as ProteinA and ProteinB.
The output would be if the two proteins are connected or not in said abstract aka a 0 or 1 output.

I have attempted to code each word of the abstract with spacy or scispacy and feed the resulting (50,word_vector_length) to my neural network.
The trouble is, neither spacy or scispacy seem able to encode the keywords.

import spacy
spa = spacy.load("en_core_web_lg")
sci = spacy.load("en_core_sci_lg")

print(spa("proteinA").vector) #results in zero vector

print(spa("proteinB").vector) #results in zero vector

print(sci("proteinA").vector) #results in zero vector

print(sci("proteinB").vector) #results in zero vector

What can I do about this situation?

Are there special keywords that either of these embeddings will latch onto?

Can I somehow train these embeddings to suit my needs?

Perhaps there's a way to include spacy in the keras Embedding layer?

Thanks!

svlandeg · 2020-04-15T08:31:46Z

svlandeg
Apr 15, 2020
Maintainer

The vectors in these models are restricted to the most common set of words from the data they were trained on, for en_core_web_lg this is Common Crawl.

The token "proteinA" will not actually be such a common word, so it's expected that it gets a zero vector. But if you try with actual gene/protein names, like "BRCA1" or "ESR", I do get a non-zero vector with en_core_web_lg.

That said, I'm not even sure you'd want to encode the actual names of the genes and the proteins. Back in the days when I was doing BioNLP myself, I would actually blind the protein & gene names in the text, to create more generic features. You don't want the NLP algorithm to learn that two specific proteins bind with eachother, you want it to learn the grammatical/lexical structures used to describe that binding, without being biased towards the actual names. You want it to start recognizing patterns like "PROT interacts with PROT", without caring what the actual PROT is, because then you'll also be able to pick up novel interactions from literature.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to encode keywords in spacy/scispacy? #5243

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to encode keywords in spacy/scispacy? #5243

vectorkt Apr 2, 2020

Replies: 1 comment

svlandeg Apr 15, 2020 Maintainer

vectorkt
Apr 2, 2020

svlandeg
Apr 15, 2020
Maintainer