Entity Embed

Entity Embed allows you to transform entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.

Using Entity Embed, you can train a deep learning model to transform records into vectors in an N-dimensional embedding space. Thanks to a contrastive loss, those vectors are organized to keep similar records close and dissimilar records far apart in this embedding space. Embedding records enables scalable ANN search, which means finding thousands of candidate duplicate pairs of records per second per CPU.

Entity Embed achieves Recall of ~0.99 with Pair-Entity ratio below 100 on a variety of datasets. Entity Embed aims for high recall at the expense of precision. Therefore, this library is suited for the Blocking/Indexing stage of an Entity Resolution pipeline. A scalabale and noise-tolerant Blocking procedure is often the main bottleneck for performance and quality on Entity Resolution pipelines, so this library aims to solve that. Note the ANN search on embedded records returns several candidate pairs that must be filtered to find the best matching pairs, possibly with a pairwise classifier (an example for that is available).

Entity Embed is based on and is a special case of the AutoBlock model described by Amazon.

⚠️ Warning: this project is under heavy development.

Documentation

https://entity-embed.readthedocs.io

Requirements

System

MacOS or Linux (tested on latest MacOS and Ubuntu via GitHub Actions).
Entity Embed can train and run on a powerful laptop. Tested on a system with 32 GBs of RAM, RTX 2070 Mobile (8 GB VRAM), i7-10750H (12 threads). With batch sizes smaller than 32 and few field types, it's possible to train and run even with 2 GB of VRAM.

Libraries

Python: >= 3.6
Numpy: >= 1.19.0
PyTorch: >= 1.7.1, < 1.9
PyTorch Lightning: >= 1.1.6, < 1.3
N2: >= 0.1.7, < 1.2

And others, see requirements.txt.

Installation

pip install entity-embed

For Conda users

If you're using Conda, you must install PyTorch beforehand to have proper CUDA support. Inside the Conda environment, please run the following command before installing Entity Embed using pip:

conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge

Examples

Run:

pip install -r requirements-examples.txt

Then check the example Jupyter Notebooks:

Deduplication, when you have a single dirty dataset with duplicates: notebooks/Deduplication-Example.ipynb
Record Linkage, when you have multiple clean datasets you need to link: notebooks/Record-Linkage-Example.ipynb
After you run the notebooks/Record-Linkage-Example.ipynb, you can check the notebooks/End-to-End-Matching-Example.ipynb to learn how to integrate Entity Embed with a pairwise classifier.

Colab

Please check notebooks/google-colab/.

Releases

See CHANGELOG.md.

Credits

This project is maintained by open-source contributors and Vinta Software.

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Commercial Support

Vinta Software is always looking for exciting work, so if you need any commercial support, feel free to get in touch: [email protected]

References

Zhang, W., Wei, H., Sisman, B., Dong, X. L., Faloutsos, C., & Page, D. (2020, January). AutoBlock: A hands-off blocking framework for entity matching. In Proceedings of the 13th International Conference on Web Search and Data Mining (pp. 744-752). (pdf)
Dai, X., Yan, X., Zhou, K., Wang, Y., Yang, H., & Cheng, J. (2020, July). Convolutional Embedding for Edit Distance. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 599-608). (pdf) (code)

Citations

If you use Entity Embed in your research, please consider citing it.

BibTeX entry:

@software{entity-embed,
  title = {{Entity Embed}: Scalable Entity Resolution using Approximate Nearest Neighbors.},
  author = {Juvenal, Flávio and Vieira, Renato},
  url = {https://github.com/vintasoftware/entity-embed},
  version = {0.0.6},
  date = {2021-07-16},
  year = {2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 393 Commits
.github		.github
docs		docs
entity_embed		entity_embed
etc		etc
example-data		example-data
notebooks		notebooks
tests		tests
.coveragerc		.coveragerc
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
AUTHORS.rst		AUTHORS.rst
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.rst		CONTRIBUTING.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
hannah_requirements.txt		hannah_requirements.txt
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-examples.txt		requirements-examples.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Entity Embed

Documentation

Requirements

System

Libraries

Installation

For Conda users

Examples

Colab

Releases

Credits

Commercial Support

References

Citations

About

Releases

Packages

Languages

License

meetcleo/entity-embed

Folders and files

Latest commit

History

Repository files navigation

Entity Embed

Documentation

Requirements

System

Libraries

Installation

For Conda users

Examples

Colab

Releases

Credits

Commercial Support

References

Citations

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages