Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block images while using PlaywrightCrawler #848

Open
Ehsan-U opened this issue Dec 29, 2024 · 3 comments
Open

Block images while using PlaywrightCrawler #848

Ehsan-U opened this issue Dec 29, 2024 · 3 comments
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@Ehsan-U
Copy link

Ehsan-U commented Dec 29, 2024

Playwright has support to abort the image resource requests, it would be great to have an option during initialization of PlaywrightCrawler
block_images: bool = True

https://playwright.dev/python/docs/api/class-route#route-abort

@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Dec 29, 2024
@Pijukatel
Copy link
Contributor

Pijukatel commented Dec 29, 2024

I believe it is already possible to do so, if you define custom 'PlaywrightBrowserController'.

I modified one existing example code with custom browser controller. Please see ImageBlockerPlaywrightBrowserController from the example. You should be able to run example out of the box. I set headless to False so that you can easily check that the images are not loaded.

Does it work for your use case?
Maybe we should create some advanced examples in documentation about custom browser controllers and custom browser plugins?

import asyncio
from typing import Mapping, Any

from playwright.async_api import Page
from typing_extensions import override

from crawlee._utils.context import ensure_context
from crawlee.browsers import BrowserPool, PlaywrightBrowserController, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyInfo


class ImageBlockerPlaywrightBrowserController(PlaywrightBrowserController):
    """Some custom browser controller to allow custom routing"""

    @override
    async def new_page(self, browser_new_context_options: Mapping[str, Any] | None = None,
        proxy_info: ProxyInfo | None = None,
    ) -> Page:
        page = await super().new_page(browser_new_context_options, proxy_info)
        await page.route("**/*", self.block_images)
        return page

    async def block_images(self, route, request):
        if request.resource_type == 'image':
            await route.abort()
            return
        await route.continue_()


class CustomPlaywrightBrowserPlugin(PlaywrightBrowserPlugin):
    """Some custom browser plugin to allow custom browser controller."""

    @override
    @ensure_context
    async def new_browser(self) -> ImageBlockerPlaywrightBrowserController:
        if not self._playwright:
            raise RuntimeError('Playwright browser plugin is not initialized.')

        if self._browser_type == 'chromium':
            browser = await self._playwright.chromium.launch(**self._browser_launch_options)
        elif self._browser_type == 'firefox':
            browser = await self._playwright.firefox.launch(**self._browser_launch_options)
        elif self._browser_type == 'webkit':
            browser = await self._playwright.webkit.launch(**self._browser_launch_options)
        else:
            raise ValueError(f'Invalid browser type: {self._browser_type}')

        return ImageBlockerPlaywrightBrowserController(
            browser,
            max_open_pages_per_browser=self._max_open_pages_per_browser,
        )


async def main() -> None:
    crawler = PlaywrightCrawler(
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl=10,
        # Custom browser pool. This gives users full control over browsers used by the crawler.
        browser_pool=BrowserPool(plugins=[CustomPlaywrightBrowserPlugin(browser_launch_options={"headless":False})]),
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract some data from the page using Playwright's API.
        posts = await context.page.query_selector_all('.athing')
        for post in posts:
            # Get the HTML elements for the title and rank within each post.
            title_element = await post.query_selector('.title a')

            # Extract the data we want from the elements.
            title = await title_element.inner_text() if title_element else None

        # Push the extracted data to the default dataset.
        await context.push_data({'title': title})

        # Find a link to the next page and enqueue it if it exists.
        await context.enqueue_links(selector='.morelink')

    # Run the crawler with the initial list of URLs.
    await crawler.run(['https://news.ycombinator.com/'])


if __name__ == '__main__':
    asyncio.run(main())

@Pijukatel
Copy link
Contributor

Pijukatel commented Dec 30, 2024

This can be also handled by defining custom pre-navigation hook on the crawler and that is simpler than the above example and probably more suitable for this specific case. Just add this hook to existing playwright crawler:

@crawler.pre_navigation_hook
async def block_images_hook(context: PlaywrightPreNavCrawlingContext) -> None:
    async def block_images(route, request):
        if request.resource_type == 'image':
            await route.abort()
            return
        await route.continue_()

    await context.page.route("**/*", block_images)

https://crawlee.dev/python/docs/examples/playwright-crawler

@manaschakrabortty
Copy link

try:
await route.abort()
except Exception as e:
print(f"Failed to abort route: {e}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants