You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I looked through the code and the current PDF loader used is PyMuPDF. Within the free libraries, PDFMiner works better than PyMuPDF and PyPDF so it would be good to have it. Additionally, documents that are handwritten or scanned will require an OCR engine which none of the above loaders support. Langchain has integrated Textract and Azure Document Intelligence loaders for the OCR use case and it will be nice to have them for khoj as well.
Happy to integrate if it's part of the plan
The text was updated successfully, but these errors were encountered:
Hi @ishan00 , can you clarify for what usecases you find PDFMiner to work better than PyMuPDF?
PyMuPDF does support OCR when used with the RapidOCR library. So Khoj can handle PDFs with scans or handwritten content.
Of course, using a local OCR library may not be as good as using an online services like Azure Document Intelligence. If so, we could add support for using a better online OCR service when configured (e.g when AZURE_API_KEY set) but falling back to use a local OCR library by default
I looked through the code and the current PDF loader used is PyMuPDF. Within the free libraries, PDFMiner works better than PyMuPDF and PyPDF so it would be good to have it. Additionally, documents that are handwritten or scanned will require an OCR engine which none of the above loaders support. Langchain has integrated Textract and Azure Document Intelligence loaders for the OCR use case and it will be nice to have them for khoj as well.
Happy to integrate if it's part of the plan
The text was updated successfully, but these errors were encountered: