Improper chunking for pdf #3803
Unanswered
anshulgoyal43
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to do pdf-chunking for my RAG
here is the code I ran
`from unstructured.partition.pdf import partition_pdf
file = "/Users/anshulgoyal/work/pdf_files/a1836-10.pdf"
print("Processing file:", file)
chunks = partition_pdf(filename=file, strategy="fast", chunking_strategy="basic")
for i in chunks:
print(i)
print("-"*100)
`
the link to pdf 'https://www.indiacode.nic.in/bitstream/123456789/18935/1/a1836-10.pdf'
The sentences are broken in middle in chunks, what am I missing?
Is this an issue with pdf or unstructured itself?
Beta Was this translation helpful? Give feedback.
All reactions