Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ingesting Excel and CSV files #659

Open
3 tasks
adamdougal opened this issue Apr 12, 2024 · 5 comments
Open
3 tasks

Support ingesting Excel and CSV files #659

adamdougal opened this issue Apr 12, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@adamdougal
Copy link
Collaborator

adamdougal commented Apr 12, 2024

Motivation

To be able to query data in Excel and CSV file format

How would you feel if this feature request was implemented?

excel

Requirements

  • Support ingesting and embedding of excel files
  • Support ingesting and embedding of csv files
  • Ensure relevant data is outputed in a tabular format on the front end

Links

Tasks

To be filled in by the engineer picking up the issue

  • Task 1
  • Task 2
  • ...
@ferrari-leo
Copy link

ferrari-leo commented May 15, 2024

@adamdougal Given that Azure Document Intelligence is used for some of the data loading, and that supports XLSX and PPT already, is there a reason why it's not already supported?

Link to documentation: https://learn.microsoft.com/en-gb/azure/ai-services/document-intelligence/concept-layout?view=doc-intel-4.0.0

Maybe it's not using an up-to-date preview?

@adamdougal
Copy link
Collaborator Author

@ferrari-leo Heya, I'm not 100% familiar with the history so not sure if that only started to be supported recently. However, given it is now supported, the only thing stopping this, is the priority and time to implement. We always welcome contributions so if you need this feature and have time we'd love a PR :)

@ferrari-leo
Copy link

@adamdougal I'll have a look at where there'll be free time! But my main point is that if the doc intelligence API already supports analysing the layout of an excel file, then why can I not drag an excel into the ingest data tab and have it processed? I can't pinpoint where in the code it distinguishes between an excel and a pdf to throw the error "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet files are not allowed." In theory there shouldn't need to be more time spent on developing this feature because it's inherent in the doc intelligent API

@ross-p-smith
Copy link
Collaborator

From memory - I think we were using the old Forms Recognizer API

@edgBR
Copy link

edgBR commented Dec 13, 2024

Hi,

Honestly, Im surprised this is not supported. According to the docs of: https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/docs/data_ingestion.md#supported-document-formats (which seems a less sophisticated version of this repository)

image

Excel and CSVs are supported.

Looking the code for CSVs:

# These file formats can always be parsed:
    file_processors = {
        ".json": FileProcessor(JsonParser(), SimpleTextSplitter()),
        ".md": FileProcessor(TextParser(), sentence_text_splitter),
        ".txt": FileProcessor(TextParser(), sentence_text_splitter),
        ".csv": FileProcessor(CsvParser(), sentence_text_splitter),
    }

It references the following file.

https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/prepdocslib/csvparser.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants