PermissionError WinError 32, cannot access file and ParseError invalid token #1780

zkfazal · 2024-11-16T15:16:55Z

zkfazal
Nov 16, 2024

Bug description

Double-checked existing issues and I could not find any related to the bug I was experiencing while trying to copy the sample code for using a Jupyter notebook to generate PDF/A files from docTR output, of which the link can be found here

My main issue is the All Merged Into One PDF/A file section, where the code is like so:

# returns: list of tuple where the first element is the (bytes) xml string and the second is the ElementTree
xml_outputs = result.export_as_xml()

# you can also merge multiple pdfs into one

merger = PdfMerger()

with TemporaryDirectory() as tmpdir:
    for i, (xml, img) in enumerate(zip(xml_outputs, docs)):
        # write the images temp
        Image.fromarray(img).save(os.path.join(tmpdir, f"{i}.jpg"))
        # write the xml content temp
        with open(os.path.join(tmpdir, f"{i}.xml"),"w") as f :
            f.write(xml_outputs[i][0].decode())
        # Init hOCR transfomer
        hocr = HocrTransform(hocr_filename=os.path.join(tmpdir, f"{i}.xml"), dpi=300)
        # Save as PDF/A
        hocr.to_pdf(out_filename=os.path.join(tmpdir, f"{i}.pdf"), image_filename=os.path.join(tmpdir, f"{i}.jpg"))
        # Append to merger
        merger.append(f'{tmpdir}/{i}.pdf')
    # Save as combined pdf
    merger.write(f'docTR-PDF.pdf')

I slightly modified the code to have with TemporaryDirectory(os.getcwd()) and extracted the hocr_filename to its own variable, but these issues do not affect the problem I'm having, which is the following error:
"PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\Me\PycharmProjects\doctr-jupyter\tmpdbajkfk8\0.pdf'"

I am currently running this in a Jupyter notebook in PyCharm. I am able to set up the docTR stuff in the initial demo (the code contained in your other sample notebook that sets up the basic model) just fine, and it can parse the text of the PDF I'm using, but once it gets to the init HOCR transform line when i=3 (so for the fourth page), it throws out this error.

The PDF I'm using is attached as a syllabus I got from a friend. This one at least gets halfway, if I use a different PDF (attached as somatosensory.pdf, a sample PDF I obtained online), it gives me a different error:
ParseError: not well-formed (invalid token): line 1, column 38224

somatosensory.pdf
Syllabus.pdf

Code snippet to reproduce the bug

model = ocr_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True)
filename = 'somatosensory.pdf'
doc = DocumentFile.from_pdf(filename)
print(f"Number of pages: {len(doc)}")
# Can also do from_url and from_image
# doc = DocumentFile.from_pdf('C:/Users/Me/Downloads/Syllabus.pdf')
result = model(doc)
result.show()
# Now merge into one PDF/A file

# returns: list of tuple where the first element is the (bytes) xml string and the second is the ElementTree
xml_outputs = result.export_as_xml()

# you can also merge multiple pdfs into one

merger = PdfMerger()

with TemporaryDirectory(dir=os.getcwd()) as tmpdir:
    for i, (xml, img) in enumerate(zip(xml_outputs, doc)):
        # write the images temp
        Image.fromarray(img).save(os.path.join(tmpdir, f"{i}.jpg"))
        # write the xml content temp
        with open(os.path.join(tmpdir, f"{i}.xml"),"w") as f :
            f.write(xml_outputs[i][0].decode())
        # Init hOCR transfomer
        the_hocr_filename=os.path.join(tmpdir, f"{i}.xml")
        hocr = HocrTransform(hocr_filename=the_hocr_filename, dpi=300)
        # Save as PDF/A
        hocr.to_pdf(out_filename=os.path.join(tmpdir, f"{i}.pdf"), image_filename=os.path.join(tmpdir, f"{i}.jpg"))
        # Append to merger
        merger.append(f'{tmpdir}/{i}.pdf')
    # Save as combined pdf
    merger.write(f'output-PDFA.pdf')

Error traceback

Traceback (most recent call last):

  File ~\miniconda3\Lib\xml\etree\ElementTree.py:1706 in feed
    self.parser.Parse(data, False)

ExpatError: not well-formed (invalid token): line 1, column 38224


During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File ~\miniconda3\Lib\site-packages\IPython\core\interactiveshell.py:3577 in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  Cell In[25], line 20
    hocr = HocrTransform(hocr_filename=the_hocr_filename, dpi=300)


  File ~\miniconda3\Lib\site-packages\ocrmypdf\hocrtransform\_hocr.py:110 in __init__
    self.hocr = ElementTree.parse(os.fspath(hocr_filename))

  File ~\miniconda3\Lib\site-packages\defusedxml\common.py:100 in parse
    return _parse(source, parser)

  File ~\miniconda3\Lib\xml\etree\ElementTree.py:1204 in parse
    tree.parse(source, parser)

  File ~\miniconda3\Lib\xml\etree\ElementTree.py:572 in parse
    parser.feed(data)

  File ~\miniconda3\Lib\xml\etree\ElementTree.py:1708 in feed
    self._raiseerror(v)

 File ~\miniconda3\Lib\xml\etree\ElementTree.py:1615 in _raiseerror
    raise err

  File <string>
ParseError: not well-formed (invalid token): line 1, column 38224

And for the PermissionError:

---------------------------------------------------------------------------
ExpatError                                Traceback (most recent call last)
File ~\miniconda3\Lib\xml\etree\ElementTree.py:1706, in XMLParser.feed(self, data)
  1705 try:
-> 1706     self.parser.Parse(data, False)
  1707 except self._error as v:

ExpatError: not well-formed (invalid token): line 1, column 28815

During handling of the above exception, another exception occurred:

ParseError                                Traceback (most recent call last)
Cell In[27], line 20
    19 the_hocr_filename=os.path.join(tmpdir, f"{i}.xml")
---> 20 hocr = HocrTransform(hocr_filename=the_hocr_filename, dpi=300)
    21 # Save as PDF/A

File ~\miniconda3\Lib\site-packages\ocrmypdf\hocrtransform\_hocr.py:110, in HocrTransform.__init__(self, hocr_filename, dpi, debug, fontname, font, debug_render_options)
   109 self.dpi = dpi
--> 110 self.hocr = ElementTree.parse(os.fspath(hocr_filename))
   111 self._fontname = fontname

File ~\miniconda3\Lib\site-packages\defusedxml\common.py:100, in _generate_etree_functions.<locals>.parse(source, parser, forbid_dtd, forbid_entities, forbid_external)
    94     parser = DefusedXMLParser(
    95         target=_TreeBuilder(),
    96         forbid_dtd=forbid_dtd,
    97         forbid_entities=forbid_entities,
    98         forbid_external=forbid_external,
    99     )
--> 100 return _parse(source, parser)

File ~\miniconda3\Lib\xml\etree\ElementTree.py:1204, in parse(source, parser)
  1203 tree = ElementTree()
-> 1204 tree.parse(source, parser)
  1205 return tree

File ~\miniconda3\Lib\xml\etree\ElementTree.py:572, in ElementTree.parse(self, source, parser)
   571 while data := source.read(65536):
--> 572     parser.feed(data)
   573 self._root = parser.close()

File ~\miniconda3\Lib\xml\etree\ElementTree.py:1708, in XMLParser.feed(self, data)
  1707 except self._error as v:
-> 1708     self._raiseerror(v)

File ~\miniconda3\Lib\xml\etree\ElementTree.py:1615, in XMLParser._raiseerror(self, value)
 1614 err.position = value.lineno, value.offset
-> 1615 raise err

ParseError: not well-formed (invalid token): line 1, column 28815

During handling of the above exception, another exception occurred:

PermissionError                           Traceback (most recent call last)
File ~\miniconda3\Lib\shutil.py:633, in _rmtree_unsafe(path, onexc)
   632 try:
--> 633     os.unlink(fullname)
   634 except OSError as err:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\Me\\PycharmProjects\\doctr-jupyter\\tmp9bu3mn8n\\0.pdf'

During handling of the above exception, another exception occurred:

PermissionError                           Traceback (most recent call last)
Cell In[27], line 11
     7 # you can also merge multiple pdfs into one
     9 merger = PdfMerger()
---> 11 with TemporaryDirectory(dir=os.getcwd()) as tmpdir:
    12     for i, (xml, img) in enumerate(zip(xml_outputs, doc)):
    13         # write the images temp
    14         Image.fromarray(img).save(os.path.join(tmpdir, f"{i}.jpg"))

File ~\miniconda3\Lib\tempfile.py:946, in TemporaryDirectory.__exit__(self, exc, value, tb)
   944 def __exit__(self, exc, value, tb):
   945     if self._delete:
--> 946         self.cleanup()

File ~\miniconda3\Lib\tempfile.py:950, in TemporaryDirectory.cleanup(self)
   948 def cleanup(self):
   949     if self._finalizer.detach() or _os.path.exists(self.name):
--> 950         self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)

File ~\miniconda3\Lib\tempfile.py:930, in TemporaryDirectory._rmtree(cls, name, ignore_errors, repeated)
   927         if not ignore_errors:
   928             raise
--> 930 _shutil.rmtree(name, onexc=onexc)

File ~\miniconda3\Lib\shutil.py:781, in rmtree(path, ignore_errors, onerror, onexc, dir_fd)
   779     # can't continue even if onexc hook returns
   780     return
--> 781 return _rmtree_unsafe(path, onexc)

File ~\miniconda3\Lib\shutil.py:635, in _rmtree_unsafe(path, onexc)
   633             os.unlink(fullname)
   634         except OSError as err:
--> 635             onexc(os.unlink, fullname, err)
636 try:
   637     os.rmdir(path)

File ~\miniconda3\Lib\tempfile.py:905, in TemporaryDirectory._rmtree.<locals>.onexc(func, path, exc)
   902 _resetperms(path)
   904 try:
--> 905     _os.unlink(path)
   906 except IsADirectoryError:
   907     cls._rmtree(path, ignore_errors=ignore_errors)

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\Me\\PycharmProjects\\doctr-jupyter\\tmp9bu3mn8n\\0.pdf'

Environment

Collecting environment information...

DocTR version: 0.10.1a0
TensorFlow version: N/A
PyTorch version: 2.5.1+cpu (torchvision 0.20.1+cpu)
OpenCV version: 4.10.0
OS: Microsoft Windows 10 Pro
Python version: 3.12.7
Is CUDA available (TensorFlow): N/A
Is CUDA available (PyTorch): No
CUDA runtime version: Could not collect
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect

Deep Learning backend

is_tf_available: False
is_torch_available: True

felixdittrich92 · 2024-11-16T18:13:59Z

felixdittrich92
Nov 16, 2024
Maintainer

Hi @zkfazal 👋,

Thanks for reporting.

I quickly tested it on my Linux machine without issues:

(doctr-dev) felix@felix-Z790-AORUS-MASTER:~/Desktop/doctr$ USE_TORCH=1 python3 /home/felix/Desktop/doctr/test.py
/home/felix/Desktop/doctr/doctr/models/utils/pytorch.py:59: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(archive_path, map_location="cpu")
Number of pages: 4
(doctr-dev) felix@felix-Z790-AORUS-MASTER:~/Desktop/doctr$

import os
from tempfile import TemporaryDirectory

from PyPDF2 import PdfMerger
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
from PIL import Image
from ocrmypdf.hocrtransform import HocrTransform

model = ocr_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True)
filename = '/home/felix/Desktop/somatosensory.pdf'
doc = DocumentFile.from_pdf(filename)
print(f"Number of pages: {len(doc)}")
# Can also do from_url and from_image
# doc = DocumentFile.from_pdf('C:/Users/Me/Downloads/Syllabus.pdf')
result = model(doc)
result.show()
# Now merge into one PDF/A file

# returns: list of tuple where the first element is the (bytes) xml string and the second is the ElementTree
xml_outputs = result.export_as_xml()

# you can also merge multiple pdfs into one

merger = PdfMerger()

with TemporaryDirectory(dir=os.getcwd()) as tmpdir:
    for i, (xml, img) in enumerate(zip(xml_outputs, doc)):
        # write the images temp
        Image.fromarray(img).save(os.path.join(tmpdir, f"{i}.jpg"))
        # write the xml content temp
        with open(os.path.join(tmpdir, f"{i}.xml"),"w") as f :
            f.write(xml_outputs[i][0].decode())
        # Init hOCR transfomer
        the_hocr_filename=os.path.join(tmpdir, f"{i}.xml")
        hocr = HocrTransform(hocr_filename=the_hocr_filename, dpi=300)
        # Save as PDF/A
        hocr.to_pdf(out_filename=os.path.join(tmpdir, f"{i}.pdf"), image_filename=os.path.join(tmpdir, f"{i}.jpg"))
        # Append to merger
        merger.append(f'{tmpdir}/{i}.pdf')
    # Save as combined pdf
    merger.write(f'output-PDFA.pdf')

output-PDFA.pdf

DocTR version: 0.10.1a0
TensorFlow version: 2.18.0
PyTorch version: 2.5.0 (torchvision 0.20.0)
OpenCV version: 4.10.0
OS: Ubuntu 24.04.1 LTS
Python version: 3.10.14
Is CUDA available (TensorFlow): Yes
Is CUDA available (PyTorch): Yes
CUDA runtime version: 12.6.77
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4080
Nvidia driver version: 560.35.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.5.1

I will try on monday to have a deeper look 👍

But on a first view this looks to me like an issue with TemporaryDirectory on your windows machine (--> PermissionError)

0 replies

zkfazal · 2024-11-16T22:55:26Z

zkfazal
Nov 16, 2024
Author

Thanks @felixdittrich92 for testing it on your own machine as well. What's weird is that I've used different temporary directories as well, so I don't know why it would get a permission error on multiple temporary directories.
It's Windows, so I'm not as familiar with permission and ownership management as compared to Linux. I'll also tinker around with different temporary directories.

0 replies

felixdittrich92 · 2024-11-17T11:44:13Z

felixdittrich92
Nov 17, 2024
Maintainer

Have you tried if the issue still exists without using a temp dir ?

3 replies

zkfazal Nov 22, 2024
Author

Yes, even when using the Downloads or Documents folder, it's still there

zkfazal Nov 22, 2024
Author

Oh ok, I see what you mean, I didn't use temp directories, just created a directory to store files, like so:

tmpdir = os.path.join(os.getcwd(), r'tmpdir')

if not os.path.exists(tmpdir):
    os.makedirs(tmpdir)

for i, (xml, img) in enumerate(zip(xml_outputs, doc)):
    # write the images temp
    Image.fromarray(img).save(os.path.join(tmpdir, f"{i}.jpg"))
    # write the xml content temp
    with open(os.path.join(tmpdir, f"{i}.xml"), "w") as f:
        f.write(xml_outputs[i][0].decode())
    # Init hOCR transfomer
    the_hocr_filename = os.path.join(tmpdir, f"{i}.xml")
    hocr = HocrTransform(hocr_filename=the_hocr_filename, dpi=300)
    # Save as PDF/A
    hocr.to_pdf(out_filename=os.path.join(tmpdir, f"{i}.pdf"), image_filename=os.path.join(tmpdir, f"{i}.jpg"))
    # Append to merger
    merger.append(f'{tmpdir}/{i}.pdf')
# Save as combined pdf
merger.write(f'output-PDFA.pdf')

But then I get this error:

Traceback (most recent call last):
  File "C:\Python312\Lib\xml\etree\ElementTree.py", line 1706, in feed
    self.parser.Parse(data, False)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 28815

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Me\PycharmProjects\doctr-python\test.py", line 56, in <module>
    hocr = HocrTransform(hocr_filename=the_hocr_filename, dpi=300)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Me\PycharmProjects\doctr-python\.venv\Lib\site-packages\ocrmypdf\hocrtransform\_hocr.py", line 110, in __init__
    self.hocr = ElementTree.parse(os.fspath(hocr_filename))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Me\PycharmProjects\doctr-python\.venv\Lib\site-packages\defusedxml\common.py", line 100, in parse
    return _parse(source, parser)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\xml\etree\ElementTree.py", line 1204, in parse
    tree.parse(source, parser)
  File "C:\Python312\Lib\xml\etree\ElementTree.py", line 572, in parse
    parser.feed(data)
  File "C:\Python312\Lib\xml\etree\ElementTree.py", line 1708, in feed
    self._raiseerror(v)
  File "C:\Python312\Lib\xml\etree\ElementTree.py", line 1615, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 28815

This is on either the fourth or fifth page, but I do see [0-3].{jpg,xml,pdf} is generated

zkfazal Nov 22, 2024
Author

Ok, opening the 3.xml file in Internet Explorer shows up to which error I got an encoding problem
It's Cafe, with an accent on the e.
So I need to figure out how to include encoding with accents

zkfazal · 2024-11-28T11:56:25Z

zkfazal
Nov 28, 2024
Author

Any update to this?

3 replies

felixdittrich92 Nov 28, 2024
Maintainer

Hey @zkfazal 👋,

Unfortunately i don't see what i can do here the issue comes from the ocrmypdf library - or deeper from python defusedxml in combination with Windows - and as mentioned i tested both files you provided and both works (Linux) without issues 😅

output-PDFA-somatosensory.pdf
output-PDFA-syllabus.pdf

felixdittrich92 Nov 28, 2024
Maintainer

Maybe @jbarlow83 has any idea here ?

zkfazal Nov 28, 2024
Author

For some reason it can't encode accents when I export as xml. That doesn't seem like a Windows exclusive issue, though.

zkfazal · 2024-11-28T17:40:34Z

zkfazal
Nov 28, 2024
Author

Never mind, I figured it out by myself, just had to unidecode it to remove accents from the XML

# Now merge into one PDF/A file

# returns: list of tuple where the first element is the (bytes) xml string and the second is the ElementTree
xml_outputs = result.export_as_xml()

# you can also merge multiple pdfs into one

merger = PdfMerger()
directory = os.getcwd()

tmpdir = os.path.join(os.getcwd(), r'tmpdir')

if not os.path.exists(tmpdir):
    os.makedirs(tmpdir)

for i, (xml, img) in enumerate(zip(xml_outputs, doc)):
    decoded_text = xml_outputs[i][0].decode()
    decoded_text_without_accents = unidecode(decoded_text)
    # write the images temp
    Image.fromarray(img).save(os.path.join(tmpdir, f"{i}.jpg"))
    # write the xml content temp
    with open(os.path.join(tmpdir, f"{i}.xml"), "w") as f:
        # f.write(xml_outputs[i][0].decode())
        f.write(decoded_text_without_accents)
    # Init hOCR transfomer
    the_hocr_filename = os.path.join(tmpdir, f"{i}.xml")
    hocr = HocrTransform(hocr_filename=the_hocr_filename, dpi=600)
    # Save as PDF/A
    hocr.to_pdf(out_filename=os.path.join(tmpdir, f"{i}.pdf"), image_filename=os.path.join(tmpdir, f"{i}.jpg"))
    # Append to merger
    merger.append(f'{tmpdir}/{i}.pdf')
# Save as combined pdf
merger.write(f'output-PDFA.pdf')

2 replies

felixdittrich92 Nov 28, 2024
Maintainer

This should not be the prefered way :D but if it works for you and you are fine with everything is ok .. unfortunately i don't have a windows machine setup yet to possible reproduce it

felixdittrich92 Nov 28, 2024
Maintainer

And i still think it's windows related wouldn't be the first issue which happens only on windows ^^

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PermissionError WinError 32, cannot access file and ParseError invalid token #1780

{{title}}

Replies: 5 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

PermissionError WinError 32, cannot access file and ParseError invalid token #1780

zkfazal Nov 16, 2024

Bug description

Code snippet to reproduce the bug

Error traceback

Environment

Deep Learning backend

Replies: 5 comments · 8 replies

felixdittrich92 Nov 16, 2024 Maintainer

zkfazal Nov 16, 2024 Author

felixdittrich92 Nov 17, 2024 Maintainer

zkfazal Nov 22, 2024 Author

zkfazal Nov 22, 2024 Author

zkfazal Nov 22, 2024 Author

zkfazal Nov 28, 2024 Author

felixdittrich92 Nov 28, 2024 Maintainer

felixdittrich92 Nov 28, 2024 Maintainer

zkfazal Nov 28, 2024 Author

zkfazal Nov 28, 2024 Author

felixdittrich92 Nov 28, 2024 Maintainer

felixdittrich92 Nov 28, 2024 Maintainer

zkfazal
Nov 16, 2024

Replies: 5 comments 8 replies

felixdittrich92
Nov 16, 2024
Maintainer

zkfazal
Nov 16, 2024
Author

felixdittrich92
Nov 17, 2024
Maintainer

zkfazal Nov 22, 2024
Author

zkfazal Nov 22, 2024
Author

zkfazal Nov 22, 2024
Author

zkfazal
Nov 28, 2024
Author

felixdittrich92 Nov 28, 2024
Maintainer

felixdittrich92 Nov 28, 2024
Maintainer

zkfazal Nov 28, 2024
Author

zkfazal
Nov 28, 2024
Author

felixdittrich92 Nov 28, 2024
Maintainer

felixdittrich92 Nov 28, 2024
Maintainer