InvoiceDoppelganger is a tool designed to compare and find similarities between invoices. It leverages advanced text processing and image analysis techniques to identify duplicate or closely related invoices, ensuring accuracy and efficiency in document management.
It's A Program which takes an input invoice in the form of PDF and compares it to a database of existing invoices based on Content and Similarity.
- Structure or Style of Tables in Invoices
- PDF Metadata
- Invoice Number
- PDF Name
- Image Similarity
- Cosine Similarity of all Metrics
- Highly Accurate
- Creation of Models to improve Performance
Extract Text from PDF using PyPDF2
Features Extracted: Text, metadata, table styles from html, invoice number, date, pdf-name
Analyze layout and structure using table styles
Add the training data to the database(list of feature vectors)
- PyPDF2
- pdfminer
- sklearn
- re
- io
Using Cosine Similarity between features that have being extraced between two extracted feature vectors
Using Image Similarity converting PDF into image and comparing them getting the similarities.
Combine Both the Similarities and return the result
- sklearn
- imagehash
- numpy
- pdf2image
- Compare the invoice with each training data and get the most similar invoice and return the similarity
- Create A Frontend Using Streamlit
First Clone the Repository into your local machine using git
git clone
Install the required dependencies using
pip3 install -r requirements.txt
You are ready to run the Program
Before running make sure to add the test data(testing pdf invoices) and training data to the respective folders
There are two ways of running this program:
Run the '' through streamlit
streamlit run
Input select the the invoice which you want to predict on.
Run the 'Find Most Similar Invoice' Button
You will get the Most similar invoice as well as the similarity from 0 to 1
Run the '' using
It will display the input invoice, most similar invoice and the similarity score into the shell