InvoiceDoppelganger is a tool designed to compare and find similarities between invoices. It leverages advanced text processing and image analysis techniques to identify duplicate or closely related invoices, ensuring accuracy and efficiency in document management.
It's A Program which takes an input invoice in the form of PDF and compares it to a database of existing invoices based on Content and Similarity.
- Structure or Style of Tables in Invoices
- PDF Metadata
- Invoice Number
- PDF Name
- Image Similarity
- Cosine Similarity of all Metrics
- Highly Accurate
- Creation of Models to improve Performance
-
Extract Text from PDF using PyPDF2
-
Features Extracted: Text, metadata, table styles from html, invoice number, date, pdf-name
-
Analyze layout and structure using table styles
-
Add the training data to the database(list of feature vectors)
-
- PyPDF2
- pdfminer
- sklearn
- re
- io
-
Using Cosine Similarity between features that have being extraced between two extracted feature vectors
-
Using Image Similarity converting PDF into image and comparing them getting the similarities.
-
Combine Both the Similarities and return the result
-
- sklearn
- imagehash
- numpy
- pdf2image
- Compare the invoice with each training data and get the most similar invoice and return the similarity
- Create A Frontend Using Streamlit
-
First Clone the Repository into your local machine using git
git clone https://github.com/definitelynotchirag/Invoice-Doppelganger
-
Install the required dependencies using
pip3 install -r requirements.txt
-
You are ready to run the Program
-
Before running make sure to add the test data(testing pdf invoices) and training data to the respective folders
There are two ways of running this program:
-
Run the 'frontend.py' through streamlit
streamlit run frontend.py
-
Input select the the invoice which you want to predict on.
-
Run the 'Find Most Similar Invoice' Button
-
You will get the Most similar invoice as well as the similarity from 0 to 1
-
Run the 'main.py' using
python3 main.py
-
It will display the input invoice, most similar invoice and the similarity score into the shell