You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The idea is to chunk a document (smartly somehow) and for each chunk figure out what set of fields can be extracted from the given schema. And finally extract the metadata.
Currently in larch, ChunkBasedMetadataExtractor*, but it's not efficient (and hence not being used anywhere) because:
It extracts all the fields in the schema per chunk
And the problem of same field being extracted in multiple chunks isn't resolved
For chunking, we can employ:
smart chunking: using embeddings for chunks and combining if chunks are relevant to create a super chunk (document)
The idea is for each chunk, we need to identify which fields are more relevant to that chunk:
We can embed the chunk and the fields and compute what fields are relevant
For fields that exist in multiple chunks, we can extract multiple responses and probably have another method/algorithm that chooses the most relevant response
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
The idea is to chunk a document (smartly somehow) and for each chunk figure out what set of fields can be extracted from the given schema. And finally extract the metadata.
Currently in larch,
ChunkBasedMetadataExtractor
*, but it's not efficient (and hence not being used anywhere) because:For chunking, we can employ:
The idea is for each chunk, we need to identify which fields are more relevant to that chunk:
Beta Was this translation helpful? Give feedback.
All reactions