Replies: 5 comments 8 replies
-
Hi @zkfazal 👋, Thanks for reporting. I quickly tested it on my Linux machine without issues:
I will try on monday to have a deeper look 👍 But on a first view this looks to me like an issue with |
Beta Was this translation helpful? Give feedback.
-
Thanks @felixdittrich92 for testing it on your own machine as well. What's weird is that I've used different temporary directories as well, so I don't know why it would get a permission error on multiple temporary directories. |
Beta Was this translation helpful? Give feedback.
-
Have you tried if the issue still exists without using a temp dir ? |
Beta Was this translation helpful? Give feedback.
-
Any update to this? |
Beta Was this translation helpful? Give feedback.
-
Never mind, I figured it out by myself, just had to unidecode it to remove accents from the XML # Now merge into one PDF/A file
# returns: list of tuple where the first element is the (bytes) xml string and the second is the ElementTree
xml_outputs = result.export_as_xml()
# you can also merge multiple pdfs into one
merger = PdfMerger()
directory = os.getcwd()
tmpdir = os.path.join(os.getcwd(), r'tmpdir')
if not os.path.exists(tmpdir):
os.makedirs(tmpdir)
for i, (xml, img) in enumerate(zip(xml_outputs, doc)):
decoded_text = xml_outputs[i][0].decode()
decoded_text_without_accents = unidecode(decoded_text)
# write the images temp
Image.fromarray(img).save(os.path.join(tmpdir, f"{i}.jpg"))
# write the xml content temp
with open(os.path.join(tmpdir, f"{i}.xml"), "w") as f:
# f.write(xml_outputs[i][0].decode())
f.write(decoded_text_without_accents)
# Init hOCR transfomer
the_hocr_filename = os.path.join(tmpdir, f"{i}.xml")
hocr = HocrTransform(hocr_filename=the_hocr_filename, dpi=600)
# Save as PDF/A
hocr.to_pdf(out_filename=os.path.join(tmpdir, f"{i}.pdf"), image_filename=os.path.join(tmpdir, f"{i}.jpg"))
# Append to merger
merger.append(f'{tmpdir}/{i}.pdf')
# Save as combined pdf
merger.write(f'output-PDFA.pdf') |
Beta Was this translation helpful? Give feedback.
-
Bug description
Double-checked existing issues and I could not find any related to the bug I was experiencing while trying to copy the sample code for using a Jupyter notebook to generate PDF/A files from docTR output, of which the link can be found here
My main issue is the All Merged Into One PDF/A file section, where the code is like so:
I slightly modified the code to have
with TemporaryDirectory(os.getcwd())
and extracted the hocr_filename to its own variable, but these issues do not affect the problem I'm having, which is the following error:"PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\Me\PycharmProjects\doctr-jupyter\tmpdbajkfk8\0.pdf'"
I am currently running this in a Jupyter notebook in PyCharm. I am able to set up the docTR stuff in the initial demo (the code contained in your other sample notebook that sets up the basic model) just fine, and it can parse the text of the PDF I'm using, but once it gets to the init HOCR transform line when i=3 (so for the fourth page), it throws out this error.
The PDF I'm using is attached as a syllabus I got from a friend. This one at least gets halfway, if I use a different PDF (attached as somatosensory.pdf, a sample PDF I obtained online), it gives me a different error:
ParseError: not well-formed (invalid token): line 1, column 38224
somatosensory.pdf
Syllabus.pdf
Code snippet to reproduce the bug
Error traceback
And for the PermissionError:
Environment
Collecting environment information...
DocTR version: 0.10.1a0
TensorFlow version: N/A
PyTorch version: 2.5.1+cpu (torchvision 0.20.1+cpu)
OpenCV version: 4.10.0
OS: Microsoft Windows 10 Pro
Python version: 3.12.7
Is CUDA available (TensorFlow): N/A
Is CUDA available (PyTorch): No
CUDA runtime version: Could not collect
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
Deep Learning backend
is_tf_available: False
is_torch_available: True
Beta Was this translation helpful? Give feedback.
All reactions