Integrate better PDF Loaders - PDFMiner, Textract, Azure Document Intelligence #810

ishan00 · 2024-06-10T17:40:11Z

I looked through the code and the current PDF loader used is PyMuPDF. Within the free libraries, PDFMiner works better than PyMuPDF and PyPDF so it would be good to have it. Additionally, documents that are handwritten or scanned will require an OCR engine which none of the above loaders support. Langchain has integrated Textract and Azure Document Intelligence loaders for the OCR use case and it will be nice to have them for khoj as well.

Happy to integrate if it's part of the plan

MythicalCow · 2024-06-10T17:46:22Z

Hi Ishan this is something we are thinking about. If you would like to work on this I can take a look at your PR.

sandesh0202 · 2024-06-14T16:51:05Z

I would like to work on integrating this feature

debanjum · 2024-07-01T08:40:10Z

Hi @ishan00 , can you clarify for what usecases you find PDFMiner to work better than PyMuPDF?

PyMuPDF does support OCR when used with the RapidOCR library. So Khoj can handle PDFs with scans or handwritten content.

Of course, using a local OCR library may not be as good as using an online services like Azure Document Intelligence. If so, we could add support for using a better online OCR service when configured (e.g when AZURE_API_KEY set) but falling back to use a local OCR library by default

ishan00 added the upgrade New feature or request label Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate better PDF Loaders - PDFMiner, Textract, Azure Document Intelligence #810

Integrate better PDF Loaders - PDFMiner, Textract, Azure Document Intelligence #810

ishan00 commented Jun 10, 2024

MythicalCow commented Jun 10, 2024

sandesh0202 commented Jun 14, 2024

debanjum commented Jul 1, 2024

Integrate better PDF Loaders - PDFMiner, Textract, Azure Document Intelligence #810

Integrate better PDF Loaders - PDFMiner, Textract, Azure Document Intelligence #810

Comments

ishan00 commented Jun 10, 2024

MythicalCow commented Jun 10, 2024

sandesh0202 commented Jun 14, 2024

debanjum commented Jul 1, 2024