token attribute to return containing ent #4898

jack-rory-staunton · 2020-01-10T21:02:37Z

jack-rory-staunton
Jan 10, 2020

Feature description

Converting things back and forth between tokens, spans and entities is often laborious. A function like

def get_ent_from_token(token):
    return [ent for ent in doc.ents 
            if ent.start_char <= token.idx <= ent.end_char][0]

simply returns the entity of which the supplied token is a part. I submit that Token should have Token.ent as an attribute that would supply the entity's span if it exists or return [] is tok.ent_type == 0.

svlandeg · 2020-01-12T20:10:30Z

svlandeg
Jan 12, 2020
Maintainer

I'm not sure we'd want to have this as part of the core library, because the function would be pretty inefficient as it needs to loop through all doc.ents to find the correct one, every time you want to query this type of information from a specific token. A user wouldn't necessarily know that it's an inefficient function, and maybe assume there's some sort of mapping/cache behind, and may end up with code that starts slowing down, depending on the lengths of the docs & number of entities in it.

However what I would suggest is that if you want this kind of functionality for your specific use-case, you could loop through all entities ONCE and set a custom attribute that keeps track of the information you want. That should keep things efficient instead of looping through them each time.

0 replies

jack-rory-staunton · 2020-01-12T20:44:39Z

jack-rory-staunton
Jan 12, 2020
Author

Thanks Sofie,

You're right about the inefficiency of the function I suggested, of course - I was just trying to show the desired functionality. If I'm reading your suggestion correctly, I would loop through the entities, and for each entity, loop through the tokens in doc[ent.start : ent.end] and for each of those tokens, I would set my custom attribute. But what exactly would the attribute be? I'd like for it to be analogous to Span.ents, which gives a list of the ents (i.e., the Span objects associated to the ents) contained by the span in question. If I store the span on the token, then I cannot serialize my docs to disk (not sure how to put doc.to_disk(path, exclude=[<token._.ent>]) ?

It does seem like this would be a good use case for the .ent_id attribute, which is (for some reason?) unfortunately not writable from the span. I hope to be able to use ent.id eventually.

0 replies

svlandeg · 2020-01-13T07:11:59Z

svlandeg
Jan 13, 2020
Maintainer

Hi @jack-rory-staunton, I was assuming that you want to access some sort of property from the entities on the token level, such as the NER type orso, and that you wanted to do something like get_ent_from_token(token).label_, in which case you could store the label as token._.label instead. But it depends on your application ofcourse.

If you really want to be able to access the actual Span object from the token, perhaps the best workaround is to just store a mapping in your code. That will prevent copying over too much information or inefficient looping.

0 replies

jack-rory-staunton · 2020-01-13T14:19:52Z

jack-rory-staunton
Jan 13, 2020
Author

Thanks again

So I think this should work:

Doc.set_extension('_tok_to_ent', default={}, force=True)

def update_tok_to_ent(doc):
    doc._._tok_to_ent = {tok:ent for ent in doc.ents 
                                for tok in ent}

Doc.set_extension('update_tok_to_ent', method=update_tok_to_ent, force=True)

def get_tok_to_ent(doc):
    return doc._._tok_to_ent

Doc.set_extension('tok_to_ent', getter=get_tok_to_ent, force=True)

# must call to initialize and/or reset after changes to entities or tokens 
mydoc._.update_tok_to_ent()
# to access the mapping for whatever
x = mydoc._.tok_to_ent
# such as
ent_of_mytoken =  mydoc._.tok_to_ent[mytoken]

It seems like a fair enough solution. I'll still need to rebuild the mapping whenever I change an entity (e.g. when retokenizing or over-writing a span).

Perhaps it's not directly related, but why are Token.ent_id and Span.ent_id not writable? It would seem that the functionality I want is already somewhere inside spaCy lying dormant.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

token attribute to return containing ent #4898

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

token attribute to return containing ent #4898

jack-rory-staunton Jan 10, 2020

Feature description

Replies: 4 comments

svlandeg Jan 12, 2020 Maintainer

jack-rory-staunton Jan 12, 2020 Author

svlandeg Jan 13, 2020 Maintainer

jack-rory-staunton Jan 13, 2020 Author

jack-rory-staunton
Jan 10, 2020

svlandeg
Jan 12, 2020
Maintainer

jack-rory-staunton
Jan 12, 2020
Author

svlandeg
Jan 13, 2020
Maintainer

jack-rory-staunton
Jan 13, 2020
Author