Skip to content

Sciences-historiques-numeriques/mathshistory

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Produce biographical information from a corpus of biographical notes

The aim of this project is to extract biographical information from the biographical notes published on the MacTutor website and, at the same time, experimenting with different NLP approaches to achieve this goal.

These texts are published under Creative Commons License 4.0 BY SA (cf. Copyright Information on the original website) and can thus be used for the present project.

The aim is to first identify named entities and link them to LOD resources like DBPaedia and Wikidata.

Then to retrieve temporal relationships and biographical information expressend in the texts in form of relations among entities, and store it in form of Linked Open Data using the SDHSS ontology ecosystem.

Data acquisition, transformation, exploration

maths_explore.ipynb

Explore the chronological list of mathematicians and prepare data acquisition

maths_import.ipynb

Import the texts into a PostgreSQL database

Then produce valid XML in order to be able to operate on the different parts and tags.

explore_db_texts.ipynb

Explore the imported textual data: lenght, distribution, etc.

db_produce_summaries.ipynb

Extract summaries in view of experimenting topic modeling and clustering

get_persons_uris_dbpedia.ipynb

Linke the existing persons to DBpedia getting their URIs


Spacy and the Universe Plugins

Explore the functionality of the main library and its many extensions

spacy_explore.ipynb

NLP treatement with Spacy and result stored in dedicated tables of the database (to be improved, adding vectors)

coreference_resolver_neuralcoref.ipynb

Tested and not adopted

spacy_coreference_resolver_spacy.ipynb

This notebook explores the own Spacy coreference resolver.

coreference_resolver_coreferee_crossLingual.ipynb


Proof of Concept

db_produce_spacy_model.ipynb

Create a data model using Spacy and store the result in a PostgreSQL database

db_add_coreferee_resolved_texts.ipynb

Add coreferenced texts produced with Coreferee to the database

get_persons_uris_wikidata.ipynb

Linke named entities to Wikidata using SPACY plugins

explore_db_cooccurrences_analysis.ipynb

First exploration of frequent terms cooccurrences (to be improved)

explore_db_entities_relationships.ipynb

Basic exploration of the NLP features in order to leverage them for entities' relationships extraction

explore_db_named_entities_and_verbs.ipynb

More specific analysis of named entities and verbs frequency, and the semantic structure of specific relationships, with focus on the structure: "study at University of..."

get_uris.ipynb

Link main persons to DBPaedia URIs

explore_db_nlp_vectors.ipynb

Explore queries using vector similarities and distances (postgreSQL extension pgvector)


Results


explore_db_relation_extraction_synctactic_dependencies.ipynb

Initial results are promising, but the diversity of linguistic expressions for the same semantic content requires the construction of overly complex algorithms. Other methods, e.g. using LLM, should be tried out first.


spacy_openai_relation_extraction.ipynb

Two ways of using the OpenAI API for information extraction were testes:

  • produce sentences then apply Spacy model and extract relationships
  • use ChatGPT to extract triples (and thus relationships)

In both cases the result is not yet satisfactory and new approaches need to be sought, either by creating a paying account on OpenAI or by using HuggingFace models

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published