Skip to content

informagi/mmead

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

99 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MMEAD: MS MARCO Entity Annotations and Disambiguations

What is MMEAD?

MMEAD, or MS MARCO Entity Annotations and Disambiguations, is a specification for entity links for the MSMARCO dataset. MMEAD proposes a JSON specification on how entity links can be shared for easier usage of entity links. Entity links produced by the Radboud Entity Linker (REL) are provided. Code to easily work with this data is available.

See also the MMEAD documentation.

How to use

MMEAD provides an API to easily use the data we provide.

Entity Links

If you load a class that uses the entity links, the data is automatically downloaded the first time you use it. The following code will load the entity links for the MSMARCO v1 passage collection:

>>> from mmead import get_links
>>> links = get_links('v1', 'passage', verbose=False, linker='rel')

After downloading and using the data for the first time, the data will be stored in cache. The first time it might take some time, but afterwards you can access the data quite quickly:

>>> print(links.load_links_from_docid(123))
{"passage":[{"entity_id":"7954681","start_pos":"126","end_pos":"134","entity":"Montreal"}],"pid":"123"}

Embeddings

We also provide the Wikipedia2Vec embeddings of the wikipedia dump that we linked to. Wikipedia2Vec embeddings contain both word and entity embeddings, we can retrieve both:

>>> from mmead import get_embeddings
>>> e = get_embeddings(300, verbose=False)
>>> montreal_word = e.load_word_embedding("Montreal")
>>> montreal_word[:5]
[-0.1258 -0.5049 -0.0563  0.4908  0.3244]

The dot-product can be used to measure similarity:

>>> montreal_word = e.load_word_embedding("Montreal")
>>> montreal_entity = e.load_entity_embedding("Montreal")
>>> green_word = e.load_word_embedding("green")

>>> montreal_word @ montreal_entity
31.83191792
>>> montreal_word @ green_word
5.55568354

Mappings

There is also a mapping from entity text to its id available, or the other way around:

>>> from mmead import get_mappings
>>> m = get_mappings(verbose=False)
>>> m.get_id_from_entity('Montreal')
7954681
>>> m.get_entity_from_id(7954681)
'Montreal'

Available data:

The following data is available through MMEAD:

Data using REL:

Passage Links

Query Links:

Mappings:

Embeddings:

Data using BLINK:

Passage Links

MMEAD provides code that automatically downloads the data and provides it through a database, so you do not have to download it manually.

Specifications:

Format for document links:

{
    "title": [],
    "headings": [],
    "body": 
    [
        {
            "entity_id": 3434750,
            "start_pos": 807,
            "end_pos": 820,
            "entity": "United States", 
            "details": 
            {
                "tag": "LOC",
                "md_score": 0.9995014071464539
            }
        },       
        {
            "entity_id": 3434750,
            "start_pos": 1206,
            "end_pos": 1219,
            "entity": "United States",
            "details": 
            {
                "tag": "LOC",
                "md_score": 0.9995985925197601
            }
        }
    ], 
    "docid": "msmarco_doc_00_0"
}

where: - title: Entities found in the title field (In our example there are no entities found) - headings: Entities found in the headings field (In our example there are no entities found) - body: Entities found in the body field (In our example there are two entities found, in the dataset there a more data points for this example) - docid: Document identifier of the collection

  • An entity is presented as:
    • entity_id: Unique entity identifier corresponding to internal wikipedia identifier
    • start_pos: Start location of the entity found
    • end_pos: Entities found in the body field
    • label: Entity label
    • details: Linker specific information

Format for passages links:

{
    "passage": 
    [
        {
            "entity_id": 965751,
            "start_pos": 181,
            "end_pos": 187,
            "entity": "BMW M3",
            "details": 
            {
                "tag": "MISC",
                "md_score": 0.6411977410316467
            }
        },
        {
            "entity_id": 221005,
            "start_pos": 241,
            "end_pos": 253,
            "entity": "Chevrolet Corvette",
            "details": 
            {
                "tag": "MISC",
                "md_score": 0.8472966551780701
            }
        }
    ],
    "pid": "msmarco_passage_00_587"
}

where:

  • pid: passage identifier corresponding to the passage id in the passage collection
  • passage: list of entities found in the passage

The entities are described the same as above.