Mathematical formulae are a significant part of scientific documents (books, articles, web pages, etc.) in the fields of science, technology, engineering, and mathematics (STEM). In most of the current information retrieval approaches, mathematical formulae are not considered, even though they are very common in texts within STEM fields. Since mathematical formulae contain a lot of important information, they should not be ignored when analyzing and comparing documents. Currently, there is no large labeled dataset available, containing mathematical formulae annotated with their semantics, that could be used to train machine learning models. >>AnnoMathTeX<< offers a first approach to facilitate the annotation of mathematical formulae in STEM documents. It recommends names for formulae and their constituting identifiers (characters/symbols, e.g. constants and variables) to the user who is annotating the document and thus enables the creation of a labeled dataset.
Identifiers in mathematical formulae are the meanings attached to symbols contained within a formula. For example, the identifier E means "energy" in the formula E=mc^2.
The concept of a formula is the name or meaning (semantics) that can be associated with it. For example, a possible concept name annotation for the formula E=mc2 would be "mass-energy equivalence".
AnnoMathTeX is a standalone web-based LaTeX text and formula annotation recommendation tool for STEM documents, implemented with the python framework django. It allows users to annotate identifiers contained in mathematical formulae, as well as entire formulae contained in a document with possible concept names selected from a list of suggested recommendations.
The recommendations for the formulae and identifer concept names are taken from five different sources:
-
arXiv: A list containing names for all lower- and upper-case Latin and Greek letter identifiers appearing in the arXiv corpus as text surrounding the identifiers, ranked by the frequency of their appearence.
-
Wikipedia: A list containing identifier names for all letters appearing in mathematical Wikipedia articles as surrounding text, ranked by the frequency of their appearence.
-
Wikidata: A SPARQL query to the Wikidata Query Services API retrieves a list of matching Wikidata items.
-
Word Window: Nouns and proper nouns from the text of the annotated document surrounding the formula. The idea being, that the text surrounding the formula will often explain the formula and its parts. Consider this example from the Wikipedia article on mass-energy equivalence:
"In physics, mass–energy equivalence states that anything having mass has an equivalent amount of energy and vice versa, with these fundamental quantities directly relating to one another by Albert Einstein's famous formula:
E=mc^2"
The sentence directly preceding the formula, contains the word "mass", which corresponds to the identifier "m" and the word "energy", which corresponds to the identifier "E". Furthermore, "mass–energy equivalence" describes the meaning of the entire formula.
The system is hosted by Wikimedia at http://annomathtex.wmflabs.org/.
If you want to run the system locally, these instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Python version >=3.6 is recommended.
Clone or download the repository. In your shell navigate to the folder AnnoMathTeX and create & activate a new virtual environment. Then run the command
pip install -r requirements.txt
In a terminal navigate to the folder where the manage.py file is located (AnnoMathTeX/annomathtex) and run the command
python manage.py runserver
Open a browser window and navigate to localhost:8000.
Select the file that you would like to annotate with the file browser.
After selecting and uploading the file you will see the processed and rendered document in your browser window. You can now start annotating. Mathematical environments are enclosed with highlighted dollar signs, and the identifiers are highlighted. All other characters that are not to be annotated in the mathematical environment are coloured in grey.
To annotate an identifier, simply click on the highlighted character (e.g. "E") in the document and you will see a pop-up with a table of recommendations. To select one of the suggested recommendations, click on the matching cell, and it will be highlighted (along with all other matching cells from different sources). The annotated identifier will be highlighted in green, and a table holding all the annotations that have been made is constructed at the top of the document. If you unselect/cancel annotations. If none of the recommendations match, you can manually enter a name.
Two different types of annotations are possible: A global annotation, and a local annotation.
By default the anotation mode is set to global annotation. This means that if you anntotate, e.g. the identifier E with "energy", all occurences of this identifier in the document will automatically receive this annotation.
To annotate an identifier locally (meaning that only this occurence of the identifier will be annotated), select the "local" option at the top of the table.
To save the anntotations, simply click the "save" button at the top left of the page. This will write the annotations to a json file and create a csv file containing an evaluation table with comparison of the performance of the different sources.
If you open the same file again at a later point in time, the annotations you made previously will be reloaded and you can continue right where you left off.
For each file, an evaluation table of the following format is constructed.
Identifier | Name | arXiv | Wikipedia | Wikidata | WordWindow | Type |
---|---|---|---|---|---|---|
X | variable | - | 6 | - | 1 | global |
p | manual insertion | - | - | - | - | global |
f | function | 2 | - | - | - | local |
The identifier X was annotated globally with "variable", which was found in the recommendations from the Wikipedia list and from the word window (positions 6 and 1 in the respective columns). For the identifier p, no matches were found; it was annotated with a manual insertion. The identifier "f" was annotated locally with "function", which was found in the recommendations from the arXiv list at position 2.
This project is licensed under the Apache License 2.0.
- Ian Mackerracher
- Philipp Scharpf
See also the list of contributors who participated in this project.
We thank the Wikimedia for hosting our web-based system.