Skip to content

cultural-ai/ContentiousTermsLOD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

How Contentious Terms About People and Cultures are Used in Linked Open Data

The repository of the research paper

DOI

Online Appendix

Data

  • We reuse the previously developed knowledge graph of contentious terminology;

  • From the knowledge graph, we extract culturally sensitive terms to inspect them in LOD-datasets; the process is in the notebook getting_query_terms.ipynb, the resulting file is query_terms.json; there are 75 EN and 82 NL canonical forms of terms, which are linked to their inflected forms (for example, "aboriginal" and "aboriginals"); with both canonical and inflected forms, there are 154 EN and 242 NL terms;

  • We query terms in four LOD datasets:

    • Wikidata (EN and NL);
    • The Getty Art & Architechture Thesaurus (AAT) (EN and NL);
    • Princeton WordNet (version 3.1) (only EN);
    • Open Dutch WordNet (version 1.3) (only NL);
  • For details on querying each dataset, refer to Jupyter notebooks in the corresponding directories:

Sets constructed for analysis

Set 1: literlas of resources from the Words Matter Knowledge Graph (or related matches)

Set 2: all retrieved literals

  • all retrieved literals by datasets are in the corresponding directories with the suffix '_query_results_{lang}.json'; for Wikidata, there are multiple compressed json files: (1) initial search results 'gzip_search_results_{lang}.json', (2) claims (for example, P31 or P279) of the retrieved entities 'gzip_results_with_claims_{lang}.json', (3) filtered results used for analysis 'gzip_results_clean_{lang}.json'

Set 3: disambiguated literals

  • samples contains (1) samples for annotations by dataset and language, (2) background information for each term presented to annotators, (3) annotated samples with the prefix "ann_" and IDs of annotators (1 and 3); the notebook samples.ipynb generates 6 csv files with samples and calculates inter-annotator agreement for each annotated sample; the mean of these agreement scores (0.8) is reported in the section 4.3;

Markers of contentiousness

The directory markers contains results for RQ2, whether contentious terms in literals have any markers of their contentiousness and if so, what these markers are and how they are given. We define two groups of markers: (1) implicit markers given in text of literals next to contentious terms and (2) explicit markers, which are specific properties with URIs.

Other directories and files

  • n_hits contains 36 csv files with number of terms' hits in the three sets by property values; the code to generate these files is in the notebook n_hits.ipynb;

LODlit module

The LODlit Python module allows querying terms in Wikidata, AAT, PWN, and ODWN. LODlit can be used to both reproduce our research results and retrieve literals from the LOD datasets for other purposes. Read more in the LODlit repository.

Paper footnotes

Citation

Andrei Nesterov, Laura Hollink, and Jacco van Ossenbruggen. 2024. How Contentious Terms About People and Cultures are Used in Linked Open Data. In Proceedings of the ACM Web Conference 2024 (WWW ’24), May 13–17, 2024, Singapore, Singapore. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3589334.3648140

Download BibTeX