Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What about crawlers? #240

Open
rgaudin opened this issue Aug 27, 2024 · 3 comments
Open

What about crawlers? #240

rgaudin opened this issue Aug 27, 2024 · 3 comments
Labels
question Further information is requested

Comments

@rgaudin
Copy link
Member

rgaudin commented Aug 27, 2024

Last night the Kiwix Wiki monitor was constantly throwing errors (Connection Timeout, 502).
The service did not restart but had apparently difficulties handling a high number of requests.
Peaking at live logs for a second, I see continuous requests from crawlers: Bytedance, Amazon, Claude, OpenAI, Bing were mentioned in this few-seconds window.

I added a denying robots.txt for both Wikis as there was none but I suppose crawlers don't look for it frequently (if they do at all). Nevertheless, about 30mn after that a successful monitor was seen.

Now that those things are more frequent, widespread and impacting our infrastructure, we might want to discuss what to do. Generalizing robots.txt seems in order. Should we do more?

@rgaudin rgaudin added the question Further information is requested label Aug 27, 2024
@benoit74
Copy link
Collaborator

benoit74 commented Sep 2, 2024

I don't think that removing our wikis from search engines is an adequate move. There is very important information for new comers on these wikis, they are already hard to find, so would become even worse without indexing in search engines.

While "AI crawlers" are probably less relevant, I think it could still be useful to let them proceed as well given that tools like LLMs might soon replace search engines in many users workflows when looking after some information.

I think the problem is more that our wiki is not capable to cope with the load "imposed" by these crawlers, and we need to find a solution for this.

Do you achieved to confirm that adding a robots.txt reduced the load / ocurrences of connection timeouts?

@rgaudin
Copy link
Member Author

rgaudin commented Sep 2, 2024

I did not look at the load evolution but after 24h of the change there hasn't been any uptimerobot alert so there is a correlation.

I share your opinion that this is original content that has value and should be indexed. Fixing our mediawiki setup is probably the proper thing to do… but not the cheapest.

@benoit74
Copy link
Collaborator

benoit74 commented Sep 2, 2024

I never said that I was happy to have to find time to fix the mediawiki setup, or that it was going to be an easy feat ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants