About text extraction from scanned invoice pdf #1450
Replies: 1 comment 2 replies
-
Tesseract can struggle with text that is low contrast compared to its background.
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello ,
So the thing is that I am facing problem during extraction of text using OCRmyPDF and the real part it's extracting all the details on the scanned invoice but not all text, leaving some text behind. I have tried many ways to get that text extracted but none have worked.
This is the code:-
import os
import subprocess
import ocrmypdf
Step 1: Upload PDF
pdf_filename = input("Enter the full path of the PDF file: ")
def convert_to_grayscale(input_pdf, output_pdf):
"""Convert a PDF to grayscale using Ghostscript."""
try:
subprocess.run(
[
"gswin64c", # Windows-compatible Ghostscript executable
"-sDEVICE=pdfwrite", "-dCompatibilityLevel=1.4", "-dNOPAUSE",
"-dBATCH", "-sOutputFile=" + output_pdf, "-sColorConversionStrategy=Gray",
"-dProcessColorModel=/DeviceGray", "-dDownsampleGrayImages=false",
"-dJPEGQ=100", "-dAutoFilterGrayImages=false", "-dGrayImageFilter=/FlateEncode",
"-dDownsampleColorImages=false", "-dDownsampleGrayImages=false", "-dJPEGQ=95",
"-dAutoFilterGrayImages=true", "-dUseCIEColor=false", "-dMaxBitmap=50000000",
"-dCompressFonts=true", "-dEmbedAllFonts=true", "-dSubsetFonts=true",
"-dUseArtBox=true", input_pdf
],
check=True
)
print(f"Grayscale conversion completed: {output_pdf}")
except subprocess.CalledProcessError as e:
print(f"Error during Ghostscript grayscale conversion: {e}")
Step 2: Convert PDF to grayscale
grayscale_pdf_filename = "grayscale_output.pdf" # Output grayscale PDF file name
print("Converting PDF to grayscale using Ghostscript...")
convert_to_grayscale(pdf_filename, grayscale_pdf_filename)
Step 3: Add OCR layer using OCRmyPDF
output_pdf_filename = "ocr_output.pdf" # Output OCR PDF file name
print("Running OCRmyPDF to add OCR layer to the grayscale PDF...")
try:
ocrmypdf.ocr(
grayscale_pdf_filename,
output_pdf_filename,
tesseract_config="--psm 3 --oem 3", # Tesseract configurations
lang=None, # Default language
rotate_pages=True, # Auto-rotate pages
deskew=True, # Deskew pages
image_dpi=300, # Image DPI for output
jpeg_quality=95, # JPEG quality
optimize=3, # Optimize PDF
compress_text=False, # Avoid compressing text
force_ocr=True, # Force OCR on all pages
remove_background=False, # Preserve background
clean=True # Clean up the output
)
print("OCR processing completed.")
except Exception as e:
print(f"Error during OCR processing: {e}")
Step 4: Extract text using pdftotext
output_text_filename = "extracted_text.txt" # Path for extracted text file
print("Extracting text using pdftotext...")
try:
# Run pdftotext to extract text
subprocess.run(
["pdftotext", "-layout", "-enc", "UTF-8", output_pdf_filename, output_text_filename],
check=True
)
print(f"Text extraction completed: {output_text_filename}")
except subprocess.CalledProcessError as e:
print(f"Error during text extraction: {e}")
Step 5: Display extracted text
if os.path.exists(output_text_filename):
with open(output_text_filename, 'r', encoding='utf-8') as file:
extracted_text = file.read()
print("Extracted Text:")
print(extracted_text)
else:
print(f"Error: {output_text_filename} not found.")
its not extracting the Faktura VAT part
this is the extracted text:-
Sprzedawca
PPHU NATHALIE-MEBLE.PL
Natalia Pietrus nr 39/2024/WDTTR
Trebaczéw 69
63-642 Perzéw
NIP: PL 6192020532 Data wystawienia:
Lo.
20.11.2024
Data dostawy / wykonania ustugi: 16.11.2024
Strona: W/
Bank: Santander Bank Polska SA Nr rachunku: PL39 1090 1144 0000 0001 3016 3134
Kod SWIFT: WBKPPLPP
Nabywea: Odbiorea:
(DFDS LOGISTICS B.V.) (DFDS LOGISTICS B.V.)
BURGEMEESTER VAN LIERPLEIN 57 BURGEMEESTER VAN LIERPLEIN 57
3134 ZB VLAARDINGEN , THE NETHERLANDS 3134 ZB VLAARDINGEN , THE NETHERLANDS
NIP: NL 801283929B02 NIP: NL 801283929B02
Opis: — zalgezniki:
[tp. Nazwa towaru/ustugi Kod CN/ PKWiU Tlosé
| Transport-zlecenie nr/ booking number: 1 sat _* 950,00 950,00
delforw1 15296434
przelew 05.01.2025 950,00 EUR 4 103,81
4 103,81 = 4 103,81
(B) Comarch ERP Optima, v. 2024.5,1.1941, nr klucza 5000034062
its not extracting the Faktura VAT part
please help me by giving a fix or a solution
this is the pdf file:-
file.pdf
Beta Was this translation helpful? Give feedback.
All reactions