-
Hi everyone, I am using Pillow from Paperless NGX via transitive dependencies. When OCRing some specific PDF files, I get the following stack trace: Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 368, in parse
ocrmypdf.ocr(**args)
File "/usr/local/lib/python3.11/site-packages/ocrmypdf/api.py", line 375, in ocr
return run_pipeline(options=options, plugin_manager=plugin_manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/ocrmypdf/_pipelines/ocr.py", line 225, in run_pipeline
return _run_pipeline(options, plugin_manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/ocrmypdf/_pipelines/ocr.py", line 176, in _run_pipeline
pdfinfo = get_pdfinfo(
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/ocrmypdf/_pipeline.py", line 173, in get_pdfinfo
return PdfInfo(
^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/ocrmypdf/pdfinfo/info.py", line 1103, in __init__
self._pages = _pdf_pageinfo_concurrent(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/ocrmypdf/pdfinfo/info.py", line 778, in _pdf_pageinfo_concurrent
executor(
File "/usr/local/lib/python3.11/site-packages/ocrmypdf/_concurrent.py", line 74, in __call__
self._execute(
File "/usr/local/lib/python3.11/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 141, in _execute
result = future.result()
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/ocrmypdf/pdfinfo/info.py", line 727, in _pdf_pageinfo_sync
return PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/ocrmypdf/pdfinfo/info.py", line 842, in __init__
self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)
File "/usr/local/lib/python3.11/site-packages/ocrmypdf/pdfinfo/info.py", line 887, in _gather_pageinfo
for info in _process_content_streams(
File "/usr/local/lib/python3.11/site-packages/ocrmypdf/pdfinfo/info.py", line 638, in _process_content_streams
yield from _find_regular_images(container, contentsinfo)
File "/usr/local/lib/python3.11/site-packages/ocrmypdf/pdfinfo/info.py", line 554, in _find_regular_images
yield ImageInfo(name=draw.name, pdfimage=pdfimage, shorthand=draw.shorthand)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/ocrmypdf/pdfinfo/info.py", line 364, in __init__
pim = PdfImage(pdfimage)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pikepdf/models/image.py", line 831, in __init__
self._jpxpil = self.as_pil_image()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pikepdf/models/image.py", line 740, in as_pil_image
return Image.open(bio)
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/PIL/Image.py", line 3323, in open
im = _open_core(
^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/PIL/Image.py", line 3304, in _open_core
im = factory(fp, filename)
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/PIL/ImageFile.py", line 137, in __init__
self._open()
File "/usr/local/lib/python3.11/site-packages/PIL/Jpeg2KImagePlugin.py", line 224, in _open
header = _parse_jp2_header(self.fp)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/PIL/Jpeg2KImagePlugin.py", line 185, in _parse_jp2_header
palette.getcolor(header.read_fields(">" + ("B" * npc)))
File "/usr/local/lib/python3.11/site-packages/PIL/ImagePalette.py", line 144, in getcolor
raise ValueError(msg)
ValueError: cannot add non-opaque RGBA color to RGB palette I tested the same PDF file with multiple versions of Pillow via calls from OCRmyPDF and Paperless NGX itself. With Pillow 9.4.0 I can successfully OCR the file, while 10.3.0 (the origin of the above stack trace) and even 10.4.0 are raising this error. Unfortunately I am not sure if the issue is even caused from here. The maintainers of Paperless NGX at least see this as a downstream issue (see here). Is anyone else familiar with this issue or has any idea where to look for it? Thanks in advance. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 8 replies
-
Are you able to modify your copy of pikepdf so that it prints |
Beta Was this translation helpful? Give feedback.
Thanks. I've created #8256 to fix this.
Without it, opening those two images raises the error you mentioned.
With it, the errors do not appear.