Block images while using PlaywrightCrawler #848

Ehsan-U · 2024-12-29T06:53:08Z

Playwright has support to abort the image resource requests, it would be great to have an option during initialization of PlaywrightCrawler
block_images: bool = True

https://playwright.dev/python/docs/api/class-route#route-abort

The text was updated successfully, but these errors were encountered:

Pijukatel · 2024-12-29T18:40:50Z

I believe it is already possible to do so, if you define custom 'PlaywrightBrowserController'.

I modified one existing example code with custom browser controller. Please see ImageBlockerPlaywrightBrowserController from the example. You should be able to run example out of the box. I set headless to False so that you can easily check that the images are not loaded.

Does it work for your use case?
Maybe we should create some advanced examples in documentation about custom browser controllers and custom browser plugins?

import asyncio
from typing import Mapping, Any

from playwright.async_api import Page
from typing_extensions import override

from crawlee._utils.context import ensure_context
from crawlee.browsers import BrowserPool, PlaywrightBrowserController, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyInfo


class ImageBlockerPlaywrightBrowserController(PlaywrightBrowserController):
    """Some custom browser controller to allow custom routing"""

    @override
    async def new_page(self, browser_new_context_options: Mapping[str, Any] | None = None,
        proxy_info: ProxyInfo | None = None,
    ) -> Page:
        page = await super().new_page(browser_new_context_options, proxy_info)
        await page.route("**/*", self.block_images)
        return page

    async def block_images(self, route, request):
        if request.resource_type == 'image':
            await route.abort()
            return
        await route.continue_()


class CustomPlaywrightBrowserPlugin(PlaywrightBrowserPlugin):
    """Some custom browser plugin to allow custom browser controller."""

    @override
    @ensure_context
    async def new_browser(self) -> ImageBlockerPlaywrightBrowserController:
        if not self._playwright:
            raise RuntimeError('Playwright browser plugin is not initialized.')

        if self._browser_type == 'chromium':
            browser = await self._playwright.chromium.launch(**self._browser_launch_options)
        elif self._browser_type == 'firefox':
            browser = await self._playwright.firefox.launch(**self._browser_launch_options)
        elif self._browser_type == 'webkit':
            browser = await self._playwright.webkit.launch(**self._browser_launch_options)
        else:
            raise ValueError(f'Invalid browser type: {self._browser_type}')

        return ImageBlockerPlaywrightBrowserController(
            browser,
            max_open_pages_per_browser=self._max_open_pages_per_browser,
        )


async def main() -> None:
    crawler = PlaywrightCrawler(
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl=10,
        # Custom browser pool. This gives users full control over browsers used by the crawler.
        browser_pool=BrowserPool(plugins=[CustomPlaywrightBrowserPlugin(browser_launch_options={"headless":False})]),
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract some data from the page using Playwright's API.
        posts = await context.page.query_selector_all('.athing')
        for post in posts:
            # Get the HTML elements for the title and rank within each post.
            title_element = await post.query_selector('.title a')

            # Extract the data we want from the elements.
            title = await title_element.inner_text() if title_element else None

        # Push the extracted data to the default dataset.
        await context.push_data({'title': title})

        # Find a link to the next page and enqueue it if it exists.
        await context.enqueue_links(selector='.morelink')

    # Run the crawler with the initial list of URLs.
    await crawler.run(['https://news.ycombinator.com/'])


if __name__ == '__main__':
    asyncio.run(main())

Pijukatel · 2024-12-30T07:00:10Z

This can be also handled by defining custom pre-navigation hook on the crawler and that is simpler than the above example and probably more suitable for this specific case. Just add this hook to existing playwright crawler:

@crawler.pre_navigation_hook
async def block_images_hook(context: PlaywrightPreNavCrawlingContext) -> None:
    async def block_images(route, request):
        if request.resource_type == 'image':
            await route.abort()
            return
        await route.continue_()

    await context.page.route("**/*", block_images)

https://crawlee.dev/python/docs/examples/playwright-crawler

manaschakrabortty · 2025-01-01T16:56:48Z

try:
await route.abort()
except Exception as e:
print(f"Failed to abort route: {e}")

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Dec 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block images while using PlaywrightCrawler #848

Block images while using PlaywrightCrawler #848

Ehsan-U commented Dec 29, 2024

Pijukatel commented Dec 29, 2024 •

edited

Loading

Pijukatel commented Dec 30, 2024 •

edited

Loading

manaschakrabortty commented Jan 1, 2025

Block images while using PlaywrightCrawler #848

Block images while using PlaywrightCrawler #848

Comments

Ehsan-U commented Dec 29, 2024

Pijukatel commented Dec 29, 2024 • edited Loading

Pijukatel commented Dec 30, 2024 • edited Loading

manaschakrabortty commented Jan 1, 2025

Pijukatel commented Dec 29, 2024 •

edited

Loading

Pijukatel commented Dec 30, 2024 •

edited

Loading