Chunking-based metadata extraction #44

NISH1001 · 2024-04-16T14:56:14Z

NISH1001
Apr 16, 2024
Maintainer

The idea is to chunk a document (smartly somehow) and for each chunk figure out what set of fields can be extracted from the given schema. And finally extract the metadata.

Currently in larch, ChunkBasedMetadataExtractor *, but it's not efficient (and hence not being used anywhere) because:

It extracts all the fields in the schema per chunk
And the problem of same field being extracted in multiple chunks isn't resolved

For chunking, we can employ:

smart chunking: using embeddings for chunks and combining if chunks are relevant to create a super chunk (document)

The idea is for each chunk, we need to identify which fields are more relevant to that chunk:

We can embed the chunk and the fields and compute what fields are relevant
For fields that exist in multiple chunks, we can extract multiple responses and probably have another method/algorithm that chooses the most relevant response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunking-based metadata extraction #44

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Chunking-based metadata extraction #44

NISH1001 Apr 16, 2024 Maintainer

Replies: 0 comments

NISH1001
Apr 16, 2024
Maintainer