Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instead of using a local file for predocs.py - get the pdf from a storage account instead #2215

Open
JasonHough75 opened this issue Dec 5, 2024 · 2 comments
Labels
enhancement New feature or request ingestion

Comments

@JasonHough75
Copy link

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [X] feature request
- [X] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

I add a 'pdf' document into the data folder and then execute the predocs.py file - all is ok. How can I get predocs to use a storage account instead - because this task will be automatic and I cannot just add a file into a folder everytime. Is there a place in the code that allows this. so instead of assigning a folder path to the predocs.py - it automatically gets the file from a storage account.

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful


Thanks! We'll be in touch soon.

@pamelafox
Copy link
Collaborator

You can see our current data ingestion options here:
https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/docs/data_ingestion.md#rag-chat-data-ingestion

By default, the project uses the manual script, which picks up files from the local system.

If you set up integrated vectorization, then you can use a schedule to automatically ingest from a blob storage account. However, integrated vectorization does not yet support extracting page numbers, which is an open issue here:
#1380

So unfortunately we do not have the perfect solution for your situation yet. I suggest following the page number issue for updates.

@pamelafox
Copy link
Collaborator

Also: Some developers deploy the prepdocslib as Azure Functions, and use a Blob storage trigger to run it.

@pamelafox pamelafox added enhancement New feature or request ingestion labels Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ingestion
Projects
None yet
Development

No branches or pull requests

2 participants