Markdown `AdapterStep` for chunking by heading #26

olirice · 2023-07-11T14:50:12Z

Context

Markdown is a common format for documents ingested into vector systems and has more exploitable structure than simple text.

This task is to create an vecs.adapters.base.AdapterStep that handles chunking markdown by heading.

Ideally it would also accept parameters for

The maximum number of words in each chunk

e.g.

from vecs.adapters import MarkdownChunker

MarkdownChunker(
  max_tokens=512
)

The text was updated successfully, but these errors were encountered:

jbritain · 2023-07-11T20:34:27Z

Will give this a go if wanted

jbritain · 2023-07-24T23:17:28Z

According to the Markdown Guide

Markdown applications don’t agree on how to handle a missing space between the number signs (#) and the heading name. For compatibility, always put a space between the number signs and the heading name.

Should we count headings without this space as valid?

###like this

olirice · 2023-07-25T15:35:49Z

My vote would be to require the space

you could reference https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/markdown_header_metadata for ideas

olirice self-assigned this Jul 11, 2023

jbritain mentioned this issue Jul 29, 2023

add markdown chunker #32

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Markdown `AdapterStep` for chunking by heading #26

Markdown `AdapterStep` for chunking by heading #26

olirice commented Jul 11, 2023 •

edited

Loading

jbritain commented Jul 11, 2023

jbritain commented Jul 24, 2023 •

edited

Loading

olirice commented Jul 25, 2023

Markdown AdapterStep for chunking by heading #26

Markdown AdapterStep for chunking by heading #26

Comments

olirice commented Jul 11, 2023 • edited Loading

Context

jbritain commented Jul 11, 2023

jbritain commented Jul 24, 2023 • edited Loading

olirice commented Jul 25, 2023

Markdown `AdapterStep` for chunking by heading #26

Markdown `AdapterStep` for chunking by heading #26

olirice commented Jul 11, 2023 •

edited

Loading

jbritain commented Jul 24, 2023 •

edited

Loading