Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a table splitter and table cleaner components #8625

Open
sjrl opened this issue Dec 11, 2024 · 0 comments
Open

Add a table splitter and table cleaner components #8625

sjrl opened this issue Dec 11, 2024 · 0 comments
Labels
P2 Medium priority, add to the next sprint if no P1 available

Comments

@sjrl
Copy link
Contributor

sjrl commented Dec 11, 2024

As a follow up to this PR #8522 which adds XLSXToDocument converter to Haystack I believe the following would also be very useful.

1. A TableSplitter component

I specifically want a component that can detect if there are multiple tables within a table so to speak. For example, this table

,A,B,C,D,E,F
1,,,,,,
2,,,,,,
3,,col_a,col_b,,,
4,,1.5,test,,col_c,col_d
5,,,,,3,True

really is composed of two tables that are separable. Ideally this component could use heuristics to figure out how to split these to tables into two smaller ones while optionally preserving the row and header columns.

This type of component is highly relevant for business users who have extremely large excel sheets that contain many different sub-tables.

2. A TableCleaner component

This component should remove empty rows and columns from a table while optionally retaining the original column and row headers. For example, for the table

,A,B,C
1,,,
2,,,
3,,,
4,,col_a,col_b
5,,1.5,test

I'd like to remove the the first three rows (1-3) and column A to end up with

,B,C
4,col_a,col_b
5,1.5,test
@julian-risch julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Medium priority, add to the next sprint if no P1 available
Projects
None yet
Development

No branches or pull requests

2 participants