Chunking: include_orig_element not showing element ids of orig_elements #2887
Replies: 2 comments 1 reply
-
Hi! I would love to get some information on this as well! |
Beta Was this translation helpful? Give feedback.
-
Hey @sachink2010, I just had my wtf moment with the library. They do not properly serialize elements into JSON when dumping, they simply serialize the object... For this reason in the JSON you won't be able to view the list of elements. You can read the JSON with:
Afterwards you can access orig_elements like this: |
Beta Was this translation helpful? Give feedback.
-
I am trying to use chunk_by_title with option include_orig_elements= True.
Results I am getting are not clear to me.
`from unstructured.chunking.title import chunk_by_title
chunks = chunk_by_title(elements, combine_text_under_n_chars=500, multipage_sections=True, include_orig_elements=True)
chunks[3].to_dict()
output: Output (orig_elements) for the chunks is showing some random data:
{'type': 'CompositeElement','element_id': '5f143fae7becf4061d5d98b4917c4abb',
'text': 'Some Dummy Text',
'metadata': {'filename': 'Att01.pdf',
'filetype': 'application/pdf',
'languages': ['eng'],
'last_modified': '2024-04-14T09:24:50',
'page_number': 3,
'orig_elements': 'eJzdVN9r2zAQ/leEn1tPtiX/6FvJ9lAGbdmyvYQSZPmciMiSJ8ttw9j/vpOSlTBCYYPC2FN8332n+073KavvCWgYwPi16pIrktRd0Ra0zHhOqYCKNRXLWV23rGnzrijr5IIkA3jRCS+Q/z2R1rpOGeFhirEWezv79RbUZusRyYuiwZoj/KQ6v0U0KzlDdLTK+FC3WmW8TosLUtAmbR4uyEvMmpSGmNVFSs/EkY9AMu0nD0OY4V49g/48CgnJD0x04EF6Zc1aajFN69HZFmk0rUrGayT0SoMRA4Taa+9plo5dnxwTfj/GhBhHraQI57w7prUwm1ls4uCrBMwmeYjo5NeD7VSvIF5pTnN2Sdllxpa0ucrZFaehesTKtZmHFhyyiqDUw3O4siQjN7fLT3fvvyyWN3e3gfxLxVJ5Haf6fW9QVbRo61aWvJRVIUXW8ExmZc/LnFGevdne8qxOc9xDiT9hL8eYFcXLHlmanQEOFa9u7p9YTETcH7yQ000ut2oi0wgSmx40EmkfwU3Eb4EMyqhhHkgHk9pgxikPTgli+5j+6HYaO2tFPmjYead2xIFXA7kmX8G1wgiysEOrDHRksZcayL19AkfucX5PtMWGmBH+5KCl26k9nDrqVjiHwh5hGSSfcVbd8IaXhaC8yopW5B1vckGrtmuaPM+oeGtnMVam2YmzOD/G0Uj0DHCo+Mv/hKap+H9iPSBjdMK0FVqTFtB6ZvJulsEW6EQ/O7ODPWnFhC4VpjsylZF67oCEb/g2qzG4gTxZtzuQwD0qCRPyCG4aDectGtPEr9A0mvDQup+13mPfYcRbg1jezybevNCkt45MogcSvClafSAgewAnFRLsCO7wauZ4fFCEQ3QqYAfJDsLz8lGNkBIFCSMhSApaFtZ4J6R/xfEPPwGXqSic'}}
`
Could someone help me on how to use chunk_by_title with include_orig_elements correctly?
Beta Was this translation helpful? Give feedback.
All reactions