Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: repeat row headings in each table chunk #3778

Open
hardchor opened this issue Nov 11, 2024 · 2 comments
Open

feat: repeat row headings in each table chunk #3778

hardchor opened this issue Nov 11, 2024 · 2 comments
Labels
chunking Related to element chunking. enhancement New feature or request

Comments

@hardchor
Copy link

Describe the bug
When chunking text with tables in them (using the by_title strategy), tables are split into chunks row-by-row (if max_characters is set sufficiently low). That's great (and aligns with best practices where each row should ideally be in its own chunk). However, now the chunk loses all context for the data in that table row.
Since that context can typically be found in the table header (i.e. typically the first row), I am currently manually going through all rows and prepend the table header (can provide code if needed, but it's not the prettiest solution since I essentially have to parse the text_as_html output and then stitch it back together).

P.S.: I also couldn't get it to produce TableChunk elements, but maybe that's not intended behaviour in this case?

To Reproduce
Run ingestion of any document with a table in it and chunk it using the by_title strategy and a sufficiently small max_characters size).

Expected behavior

  1. If the table header is in a chunk of its own, it doesn't produce a chunk.
  2. Each subsequent table row chunk gets prefixed with the table header.
<table>
  <thead>
    <tr>
      <th>property1</th>
      <th>property2</th>
      <th>property3</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>value1</td>
      <td>value2</td>
      <td>value3</td>
    </tr>
  </tbody>
</table>
@hardchor hardchor added the bug Something isn't working label Nov 11, 2024
@scanny
Copy link
Collaborator

scanny commented Nov 12, 2024

@hardchor yes, we've thought of doing that. Unfortunately, detecting whether headers are present and how long they are is really something that would need a model of its own to do reliably.

I'm changing this to an enhancement since the current behavior is the expected behavior. We'll track this and see how it fits into the roadmap.

@scanny scanny added enhancement New feature or request chunking Related to element chunking. and removed bug Something isn't working labels Nov 12, 2024
@scanny scanny changed the title bug/Chunking tables feat: repeat row headings in each table chunk Nov 12, 2024
@scanny
Copy link
Collaborator

scanny commented Nov 12, 2024

@hardchor re: the TableChunk bit, I think you'll find that whenever a Table (Python) element is large enough to need splitting that it ends up as two or more TableChunk objects. However, when serialized to JSON, both Table and TableChunk elements get "type": "Table", so the two Python element-types look the same in JSON form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chunking Related to element chunking. enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants