Instead of using a local file for predocs.py - get the pdf from a storage account instead #2215

JasonHough75 · 2024-12-05T16:39:42Z

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [X] feature request
- [X] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

I add a 'pdf' document into the data folder and then execute the predocs.py file - all is ok. How can I get predocs to use a storage account instead - because this task will be automatic and I cannot just add a file into a folder everytime. Is there a place in the code that allows this. so instead of assigning a folder path to the predocs.py - it automatically gets the file from a storage account.

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful

Thanks! We'll be in touch soon.

The text was updated successfully, but these errors were encountered:

pamelafox · 2024-12-11T20:38:22Z

You can see our current data ingestion options here:
https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/docs/data_ingestion.md#rag-chat-data-ingestion

By default, the project uses the manual script, which picks up files from the local system.

If you set up integrated vectorization, then you can use a schedule to automatically ingest from a blob storage account. However, integrated vectorization does not yet support extracting page numbers, which is an open issue here:
#1380

So unfortunately we do not have the perfect solution for your situation yet. I suggest following the page number issue for updates.

pamelafox · 2024-12-11T20:40:34Z

Also: Some developers deploy the prepdocslib as Azure Functions, and use a Blob storage trigger to run it.

pamelafox added enhancement New feature or request ingestion labels Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instead of using a local file for predocs.py - get the pdf from a storage account instead #2215

Instead of using a local file for predocs.py - get the pdf from a storage account instead #2215

JasonHough75 commented Dec 5, 2024

Please provide us with the following information:

pamelafox commented Dec 11, 2024

pamelafox commented Dec 11, 2024

Instead of using a local file for predocs.py - get the pdf from a storage account instead #2215

Instead of using a local file for predocs.py - get the pdf from a storage account instead #2215

Comments

JasonHough75 commented Dec 5, 2024

Please provide us with the following information:

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

azd version?

Versions

Mention any other details that might be useful

pamelafox commented Dec 11, 2024

pamelafox commented Dec 11, 2024

This issue is for a: (mark with an `x`)