Skip to content

Repository of my master thesis in Data Science at the University of Mannheim, namely "Binarization of Knowledge Graph Embeddings with Semantic Preservation".

License

Notifications You must be signed in to change notification settings

vitor-faria/kgembeddings-binarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kgembeddings-binarization

Repository of the thesis Binarization of Knowledge Graph Embeddings with Semantic Preservation, submitted by Vitor Faria de Souza to the Data and Web Science Group as the final requirement for obtaining the Master degree in Data Science at the University of Mannheim. This thesis was supervised by prof. Heiko Paulheim between August 2023 and January 2024.

About this thesis (Introduction)

Knowledge graphs (KGs) have emerged as powerful tools for representing real-world information from various domains in a format that is both human-readable and machine-readable. A knowledge graph is a directed graph consisting of nodes representing entities and edges representing relationships between these entities. This graph-based representation allows the integration of various data sources and facilitates the exploration of connections and patterns within the data. Knowledge graphs find applications in diverse domains such as semantic search, recommendation systems, question answering, and data integration. As symbolic representations of knowledge, their applicability can be expanded when they are converted to numerical representations, and this is where knowledge graph embeddings (KGEs) play a crucial role. These dense and low-dimensional vector representations enable computational models to calculate distances and similarities and to make predictions based on underlying patterns in the KG's triples, facilitating the application of machine learning algorithms for tasks such as entity classification, entity clustering, and link prediction.

However, one of the significant drawbacks of traditional KGEs is their ever-growing size with the increasing scale of knowledge graphs. As knowledge graphs expand to incorporate more entities and relationships, the size of KGEs increases proportionally, leading to challenges in storage, computation, and scalability. This growth poses significant challenges in terms of storage and computational efficiency when working with traditional KGEs. As knowledge graphs continue to expand, the need for more compact and scalable representations of KGEs becomes imperative. Binarization techniques offer a promising approach to address these challenges by compressing embeddings into binary vectors while preserving their semantic information, as already achieved in previous work for word embeddings. By reducing the storage footprint and computational complexity of KGEs, binarization techniques enable efficient processing and analysis of large-scale knowledge graphs, without losing their applicability.

Problem Statement

Traditional embeddings are often represented as multi-dimensional floating-point vectors, that may result in large file sizes. The need for more compact representations that maintain semantic information is evident, especially in applications with limited storage capacity or those that require real-time processing. The challenge is to explore binarization techniques that significantly reduce the storage footprint while preserving the semantic richness of the original embeddings in downstream tasks.

However, compact representations are subject to the same trade-off observed in dimensionality reduction techniques: the more compact they become, the more information is probably lost. Therefore, semantic preservation should be analyzed on both sides. Binarized embeddings should drastically reduce file sizes and computational complexity of operations and still maintain competitive scores in performance metrics, such as accuracy in classification, across various tasks and datasets. In the best case, the loss in these scores when using binarized embeddings compared to their original version should not be statistically significant. Thus, the problem statement revolves around finding effective binarization techniques that strike a balance between storage efficiency and semantic preservation of KGEs, addressing the challenges posed by the ever-growing size of knowledge graphs and the diverse requirements of knowledge graph-based applications.

Pursued Approach

To address the challenges outlined above, this research pursues a comprehensive approach towards investigating binarization techniques for knowledge graph embeddings. Various binarization methods were evaluated using the established Evaluation Frameworks GEval and DLCC on a variety of tasks and datasets, with the aim of systematically comparing and analyzing the semantic preservation of KGEs in different scenarios. This approach involves experimenting with a naive binarization method as baseline, that simply binarizes vectors based on averages, and a more advanced technique, based on the autoencoding architecture proposed by Tissier et al. for binarizing word embeddings.

Furthermore, this research emphasizes the importance of comparing different types of KGEs, including various models such as RDF2Vec, TransE , ComplEx, DistMult, RESCAL, RotatE and TransR. By evaluating the performance of binarization techniques across a diverse set of embeddings, this research aims to identify the most effective approaches to preserving semantic information while reducing the storage footprint, and also examines how the size of the datasets and the complexity of the task influence the effectiveness of binarization techniques. Thus, the pursued approach seeks not only to provide insight into optimizing storage and computation without compromising the semantic richness of KGEs, but also to interpret what information is lost when applying such processes.

The findings of this research have practical implications for applications that rely on knowledge graph embeddings. By identifying effective binarization techniques and understanding their impact on diverse tasks and datasets, this research contributes to the development of more efficient and scalable solutions for handling large-scale knowledge graphs. The exploration of autoencoding methods for near-lossless binarization opens avenues for optimizing storage and computation without compromising semantic information.

Key Findings (Conclusion)

Knowledge graph embeddings were converted to binary vectors that are considerably more compact, with file sizes that are 10 to 50 times smaller, and semantic preservation was achieved in the majority of tasks with binary vectors autoencoded in more dimensions than originally. But results varied for each task, embedding model and dataset.

For GEval, in the classification task, the embeddings autoencoded in 512 bits generally performed well, with only a modest loss in accuracy (3-6%). In particular, certain binary variants even outperformed their original counterparts, showcasing the potential of binary embeddings to preserve semantic information. However, the performance varied between different models and datasets. Similar results were observed in clustering tasks, but with greater variability, probably due to the nature of unsupervised learning tasks, where convergence may not always occur. In contrast, regression tasks revealed that embeddings autoencoded in 128 bits often achieved better performance than their respective continuous versions, indicating that dimensionality reduction can be beneficial in these scenarios.

The document similarity task and the entity-relatedness task showed that autoencoded binary embeddings generally achieved higher correlations in their respective datasets, even when compared to the original vectors. This suggests that the autoencoding technique seems to preserve well the semantic information required for these tasks, but further investigation with more datasets should be conducted. However, semantic analogy tasks indicated that while some binary variants could preserve semantic relationships in very particular cases, there was a noticeable loss of precision, suggesting that much of the translational properties from continuous vectors were lost during binarization.

In the DLCC framework, focusing on classification tasks, the binary vectors autoencoded in 512 bits consistently provided expressive semantic preservation, with insignificant accuracy losses in half of the datasets, surpassing other binarization methods. Furthermore, specific class separation tasks, such as relations to particular individuals, qualified cardinality restrictions, and particular relations to particular individuals, showed reasonable semantic preservation, often ranging from 0 to 5% of loss of accuracy. Other class-separation tasks, such as qualified restrictions and cardinality restrictions, suggested that the loss of information in binary embeddings is higher for inverse relations than for direct ones. Particularly for qualified cardinality restrictions in outgoing edges, binary embeddings often scored even better than their original versions.

In general, models that typically figure as the top performers in these evaluation frameworks, such as RDF2vecSG and TransE with L2 regularization, also had their performance well-preserved, especially when binarized using 512-bit autoencoding. However, the increase in the number of dimensions is only beneficial if the downstream task in mind is or can be optimized for binary operations. Otherwise, these operations might take even longer with binary vectors than originally with continuous vectors of fewer dimensions.

Code structure

This repository has as main objectives the reproducibility of the experiments and the transparency of the results, to facilitate future work in similar directions. The structure of this repository consists of the thesis files, as well as two folders that reflect its objectives: resources and notebooks.

Towards Reproducibility

To reproduce the experiments conducted in this research, 4 Jupyter notebooks from the notebooks folder are relevant:

  1. list_goldenset_entities collects all DBpedia entities present in GEval and DLCC gold standard files, for further entity filtering. Sets of entities from both evaluation frameworks are saved as pickle files, which are available in the resources and are used in the subsequent steps;
  2. get_vectors_pipeline downloads original KGE vector files from KGvec2go and produce the variants used in the experiment. This step is mostly based on the C binarizer method from near-lossless-binarization and the notebook contains instructions on how to run it;
  3. run_dlcc runs the DLCC evaluation framework with the previously obtained embedding variants. This step is based on both DL-TC-Generator and dl-evaluation-framework original repositories, and further details and instructions are contained in the notebook;
  4. run_geval runs the GEval evaluation framework with the previously obtained embedding variants. This step is based on the Evaluation-Framework original repository, and further details and instructions are contained in the notebook.

With the known limitation in mind that binary vectors produced with the autoencoding architecture from near-lossless-binarization are nondeterministic, the compact VEC files were also stored in the resources. All notebooks ran in Python 3.8.3 and the versions of relevant packages are listed in the requirements.

Towards Transparency

Besides providing the thesis files (PDF, LaTeX and references), the results from both evaluation frameworks are also available in the resources folder. The Jupyter notebooks used to analyze them and produce the graphics and tables present in the thesis paper are also available: analyze_dlcc and analyze_geval, in the notebooks folder.

About

Repository of my master thesis in Data Science at the University of Mannheim, namely "Binarization of Knowledge Graph Embeddings with Semantic Preservation".

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published