OCR failure on certain PDF pages #1812

vikasr111 · 2024-12-03T10:33:45Z

vikasr111
Dec 3, 2024

Bug description

I am trying to perform OCR using DocTR on a PDF document. I have noticed that the OCR for the whole document is failing because of one page in the document. When I ran that individual one-page pdf, I got the full error log. Here's the PDF:
po-r.pdf

Here's the error log:

Traceback (most recent call last):
  File "/app/main.py", line 25, in <module>
    main(args.filename)
  File "/app/main.py", line 12, in main
    doctr_onnx(filename)
  File "/app/doctr_onnx.py", line 394, in doctr_onnx
    result = predictor(docs)
  File "/usr/local/lib/python3.10/site-packages/onnxtr/models/predictor/predictor.py", line 132, in __call__
    word_preds = self.reco_predictor([crop for page_crops in crops for crop in page_crops], **kwargs)
  File "/usr/local/lib/python3.10/site-packages/onnxtr/models/recognition/predictor/base.py", line 67, in __call__
    processed_batches = self.pre_processor(crops)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.10/site-packages/onnxtr/models/preprocessor/base.py", line 105, in __call__
    samples = list(multithread_exec(self.sample_transforms, x))
  File "/usr/local/lib/python3.10/site-packages/onnxtr/utils/multithreading.py", line 47, in multithread_exec
    results = map(lambda x: x, tp.map(func, seq))  # noqa: C417
  File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
  File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/usr/local/lib/python3.10/site-packages/onnxtr/models/preprocessor/base.py", line 70, in sample_transforms
    x = self.resize(x)
  File "/usr/local/lib/python3.10/site-packages/onnxtr/transforms/base.py", line 58, in __call__
    img_resized = Image.fromarray(img).resize(tmp_size, resample=self.interpolation)
  File "/usr/local/lib/python3.10/site-packages/PIL/Image.py", line 2200, in resize
    return self._new(self.im.resize(size, resample, box))
ValueError: height and width must be > 0

I printed the docs of DocTR to debug further and here's the result:

Number of pages in po-r.pdf: 1
Pages: [array([[[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       ...,

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]]], dtype=uint8)]

Code snippet to reproduce the bug

from doctr.models import ocr_predictor
from doctr.io import DocumentFile

# load image
pdf_path = "./data/invoice/po-r.pdf"

# extracting text from input image using docTR
docs = DocumentFile.from_pdf(pdf_path)

# load model
predictor = ocr_predictor(
    det_arch="db_mobilenet_v3_large",
    reco_arch="parseq",
    pretrained=True,
    preserve_aspect_ratio=False,
    symmetric_pad=False,
    )

predictor.det_predictor.model.postprocessor.bin_thresh = 0.35
predictor.det_predictor.model.postprocessor.box_thresh = 0.3


result = predictor(docs)

# display ocr boxes
result.show(docs)

Error traceback

Here's the error log:

Traceback (most recent call last):
  File "/app/main.py", line 25, in <module>
    main(args.filename)
  File "/app/main.py", line 12, in main
    doctr_onnx(filename)
  File "/app/doctr_onnx.py", line 394, in doctr_onnx
    result = predictor(docs)
  File "/usr/local/lib/python3.10/site-packages/onnxtr/models/predictor/predictor.py", line 132, in __call__
    word_preds = self.reco_predictor([crop for page_crops in crops for crop in page_crops], **kwargs)
  File "/usr/local/lib/python3.10/site-packages/onnxtr/models/recognition/predictor/base.py", line 67, in __call__
    processed_batches = self.pre_processor(crops)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.10/site-packages/onnxtr/models/preprocessor/base.py", line 105, in __call__
    samples = list(multithread_exec(self.sample_transforms, x))
  File "/usr/local/lib/python3.10/site-packages/onnxtr/utils/multithreading.py", line 47, in multithread_exec
    results = map(lambda x: x, tp.map(func, seq))  # noqa: C417
  File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
  File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/usr/local/lib/python3.10/site-packages/onnxtr/models/preprocessor/base.py", line 70, in sample_transforms
    x = self.resize(x)
  File "/usr/local/lib/python3.10/site-packages/onnxtr/transforms/base.py", line 58, in __call__
    img_resized = Image.fromarray(img).resize(tmp_size, resample=self.interpolation)
  File "/usr/local/lib/python3.10/site-packages/PIL/Image.py", line 2200, in resize
    return self._new(self.im.resize(size, resample, box))
ValueError: height and width must be > 0

Environment

Python - 3.11.7

Deep Learning backend

Torch

felixdittrich92 · 2024-12-03T10:38:27Z

felixdittrich92
Dec 3, 2024
Maintainer

Thanks for reporting 👍

Would it be possible to share such an pdf that we can reproduce the issue ?
And i see it's OnnxTR related ^^

/usr/local/lib/python3.10/site-packages/onnxtr/transforms/base.py

0 replies

vikasr111 · 2024-12-03T12:08:30Z

vikasr111
Dec 3, 2024
Author

Here's the pdf link: https://github.com/user-attachments/files/17991370/po-r.pdf

I ran it through both DocTR and OnnxTR and ran into same problem.

0 replies

felixdittrich92 · 2024-12-03T13:24:07Z

felixdittrich92
Dec 3, 2024
Maintainer

Hey @vikasr111 👋,

I tested it with both docTR and OnnxTR without success to reproduce the bug ..Could you try to provide the absolute path to the pdf ?

I used the same args as provided in your snippet - only changed the pdf path to absolute

0 replies

felixdittrich92 · 2024-12-03T13:25:58Z

felixdittrich92
Dec 3, 2024
Maintainer

result.show(docs) is outdated btw it's only result.show() since v0.9 if i remember correctly

docTR:

OnnxTR:

0 replies

vikasr111 · 2024-12-04T05:34:02Z

vikasr111
Dec 4, 2024
Author

@felixdittrich92 That's odd. When I run this directly in my system it works fine. But when I run the same code in docker container I get this error.

Will investigate more as this error is coming from Pillow where it's getting blank image for the page.

0 replies

felixdittrich92 · 2024-12-04T06:11:04Z

felixdittrich92
Dec 4, 2024
Maintainer

Maybe an issue with pypdfium2 ?
What do you use as base image in your dockerfile ?

0 replies

vikasr111 · 2024-12-04T09:28:09Z

vikasr111
Dec 4, 2024
Author

@felixdittrich92 I use python:3.11-slim-bullseye

Here's my full Dockerfile:

# Base Stage: Install common dependencies
FROM python:3.11-slim-bullseye

# Set the working directory in the container to /app
WORKDIR /app

# Set PYTHONPATH to include the /app directory
ENV PYTHONPATH="${PYTHONPATH}:/app"

# Add the rest of the current directory contents into the container at /app
COPY . .

# Install dependencies for OCR
RUN apt-get update \
    && apt-get install --no-install-recommends -y libgl1-mesa-glx libglib2.0-0  build-essential \
    # Install your Python packages that require compilation here
    && pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu \
    && pip install "python-doctr[torch]" \
    && pip install --no-cache-dir "onnxtr[cpu]" \
    && pip install numpy scikit-learn matplotlib pymupdf \
    # Cleanup: Remove build-essential and cleanup apt cache in the same command
    && apt-get purge -y build-essential \
    && apt-get autoremove -y \
    && rm -rf /var/lib/apt/lists/* /root/.cache/pip

I found some lead. When I set symmetric_pad=False it works fine but when I set it to True it starts giving error for this particular PDF only.

2 replies

felixdittrich92 Dec 4, 2024
Maintainer

@vikasr111 quickly tested: 😅
Does also not work on your local machine ?

felixdittrich92 Dec 4, 2024
Maintainer

Also created a fresh environment that all sub-packages are up-to-date

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR failure on certain PDF pages #1812

{{title}}

Replies: 7 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

OCR failure on certain PDF pages #1812

vikasr111 Dec 3, 2024

Bug description

Code snippet to reproduce the bug

Error traceback

Environment

Deep Learning backend

Replies: 7 comments · 2 replies

felixdittrich92 Dec 3, 2024 Maintainer

vikasr111 Dec 3, 2024 Author

felixdittrich92 Dec 3, 2024 Maintainer

felixdittrich92 Dec 3, 2024 Maintainer

vikasr111 Dec 4, 2024 Author

felixdittrich92 Dec 4, 2024 Maintainer

vikasr111 Dec 4, 2024 Author

felixdittrich92 Dec 4, 2024 Maintainer

felixdittrich92 Dec 4, 2024 Maintainer

vikasr111
Dec 3, 2024

Replies: 7 comments 2 replies

felixdittrich92
Dec 3, 2024
Maintainer

vikasr111
Dec 3, 2024
Author

felixdittrich92
Dec 3, 2024
Maintainer

felixdittrich92
Dec 3, 2024
Maintainer

vikasr111
Dec 4, 2024
Author

felixdittrich92
Dec 4, 2024
Maintainer

vikasr111
Dec 4, 2024
Author

felixdittrich92 Dec 4, 2024
Maintainer

felixdittrich92 Dec 4, 2024
Maintainer