Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all links are found when custom target_content is set. #1

Open
spaceemotion opened this issue Nov 8, 2023 · 1 comment
Open

Comments

@spaceemotion
Copy link

spaceemotion commented Nov 8, 2023

I wanted to try out this package for an internal RAG presentation based on some of our company website data. Our website has the typical structure: header, main content and footer. Most of the links are in the header and footer part, but if I set target_content to only the main container, no links are getting found.

Is there a way to have all URLs collected outside of the container i ultimately want to parse into markdown? When i don't specify the target i am getting quite a mess of navigation and footer link data that repeats on every page (which would have to be cleaned up for every single page)..

Thanks in advance!


Edit: I think it was actually a combination of the domain, base path flags and the target content that returns zero results. once I removed all the extra flags, the target content worked as advertised.

@paulpierre
Copy link
Owner

hey @spaceemotion 👋 ... stoked you're giving md_crawler a try!

what would help tremendously is if you provided a sample url to html (or DM/email me a zip) where this is failing and args you tested. it'll make debugging easier. thx!

@github-staff github-staff deleted a comment May 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants