-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block images while using PlaywrightCrawler #848
Comments
I believe it is already possible to do so, if you define custom 'PlaywrightBrowserController'. I modified one existing example code with custom browser controller. Please see ImageBlockerPlaywrightBrowserController from the example. You should be able to run example out of the box. I set headless to False so that you can easily check that the images are not loaded. Does it work for your use case? import asyncio
from typing import Mapping, Any
from playwright.async_api import Page
from typing_extensions import override
from crawlee._utils.context import ensure_context
from crawlee.browsers import BrowserPool, PlaywrightBrowserController, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyInfo
class ImageBlockerPlaywrightBrowserController(PlaywrightBrowserController):
"""Some custom browser controller to allow custom routing"""
@override
async def new_page(self, browser_new_context_options: Mapping[str, Any] | None = None,
proxy_info: ProxyInfo | None = None,
) -> Page:
page = await super().new_page(browser_new_context_options, proxy_info)
await page.route("**/*", self.block_images)
return page
async def block_images(self, route, request):
if request.resource_type == 'image':
await route.abort()
return
await route.continue_()
class CustomPlaywrightBrowserPlugin(PlaywrightBrowserPlugin):
"""Some custom browser plugin to allow custom browser controller."""
@override
@ensure_context
async def new_browser(self) -> ImageBlockerPlaywrightBrowserController:
if not self._playwright:
raise RuntimeError('Playwright browser plugin is not initialized.')
if self._browser_type == 'chromium':
browser = await self._playwright.chromium.launch(**self._browser_launch_options)
elif self._browser_type == 'firefox':
browser = await self._playwright.firefox.launch(**self._browser_launch_options)
elif self._browser_type == 'webkit':
browser = await self._playwright.webkit.launch(**self._browser_launch_options)
else:
raise ValueError(f'Invalid browser type: {self._browser_type}')
return ImageBlockerPlaywrightBrowserController(
browser,
max_open_pages_per_browser=self._max_open_pages_per_browser,
)
async def main() -> None:
crawler = PlaywrightCrawler(
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=10,
# Custom browser pool. This gives users full control over browsers used by the crawler.
browser_pool=BrowserPool(plugins=[CustomPlaywrightBrowserPlugin(browser_launch_options={"headless":False})]),
)
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Extract some data from the page using Playwright's API.
posts = await context.page.query_selector_all('.athing')
for post in posts:
# Get the HTML elements for the title and rank within each post.
title_element = await post.query_selector('.title a')
# Extract the data we want from the elements.
title = await title_element.inner_text() if title_element else None
# Push the extracted data to the default dataset.
await context.push_data({'title': title})
# Find a link to the next page and enqueue it if it exists.
await context.enqueue_links(selector='.morelink')
# Run the crawler with the initial list of URLs.
await crawler.run(['https://news.ycombinator.com/'])
if __name__ == '__main__':
asyncio.run(main()) |
This can be also handled by defining custom pre-navigation hook on the crawler and that is simpler than the above example and probably more suitable for this specific case. Just add this hook to existing playwright crawler: @crawler.pre_navigation_hook
async def block_images_hook(context: PlaywrightPreNavCrawlingContext) -> None:
async def block_images(route, request):
if request.resource_type == 'image':
await route.abort()
return
await route.continue_()
await context.page.route("**/*", block_images) |
try: |
Playwright has support to abort the image resource requests, it would be great to have an option during initialization of PlaywrightCrawler
block_images: bool = True
https://playwright.dev/python/docs/api/class-route#route-abort
The text was updated successfully, but these errors were encountered: