Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: process_start_urls in parallel #159

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

aircloud
Copy link

@aircloud aircloud commented Jun 7, 2023

process_start_urls 目前会阻塞下方 workers,这样要求使用者一开始就有所有 url
有的时候 url 可能是从某个接口或者消息队列持续地获取到的,一开始并没有。

因此我的改法是将其并行化,作者有空烦请看看这样是否有问题。

@howie6879
Copy link
Owner

howie6879 commented Jun 8, 2023

确实有这个问题,ruia会一次获取所有的url再开始爬取

有的时候 url 可能是从某个接口或者消息队列持续地获取到的,一开始并没有。

为什么不在外层你获取到一个url就丢到ruia实现的Spide类去执行,如:

class DemoSpider:
    pass


async for url in mq_urls:
    DemoSpider.starts_url = [url]
    await DemoSpider.start()

还是我理解错了你的意思?

@aircloud
Copy link
Author

aircloud commented Jun 8, 2023

按照上面的写法的话,我理解相当于:获取一个、执行一个、再获取一个

这样是不是没办法用它的并发能力了?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants