Not all links are found when custom `target_content` is set. #1

spaceemotion · 2023-11-08T09:25:30Z

I wanted to try out this package for an internal RAG presentation based on some of our company website data. Our website has the typical structure: header, main content and footer. Most of the links are in the header and footer part, but if I set target_content to only the main container, no links are getting found.

Is there a way to have all URLs collected outside of the container i ultimately want to parse into markdown? When i don't specify the target i am getting quite a mess of navigation and footer link data that repeats on every page (which would have to be cleaned up for every single page)..

Thanks in advance!

Edit: I think it was actually a combination of the domain, base path flags and the target content that returns zero results. once I removed all the extra flags, the target content worked as advertised.

The text was updated successfully, but these errors were encountered:

paulpierre · 2023-11-10T20:06:03Z

hey @spaceemotion 👋 ... stoked you're giving md_crawler a try!

what would help tremendously is if you provided a sample url to html (or DM/email me a zip) where this is failing and args you tested. it'll make debugging easier. thx!

github-staff deleted a comment May 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not all links are found when custom `target_content` is set. #1

Not all links are found when custom `target_content` is set. #1

spaceemotion commented Nov 8, 2023 •

edited

Loading

paulpierre commented Nov 10, 2023

Not all links are found when custom target_content is set. #1

Not all links are found when custom target_content is set. #1

Comments

spaceemotion commented Nov 8, 2023 • edited Loading

paulpierre commented Nov 10, 2023

Not all links are found when custom `target_content` is set. #1

Not all links are found when custom `target_content` is set. #1

spaceemotion commented Nov 8, 2023 •

edited

Loading