Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add a new guide on how to avoid getting blocked #576

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

MostlyKIGuess
Copy link

Description

Testing

  • Checked locally by running the website as given in the CONTRIBUTING.md guide.

image

image

image

@MostlyKIGuess
Copy link
Author

@janbuchar @vdusek Can you help me figure out add the operating system change, I am not sure if the current code works and there's no way to test it..

@vdusek vdusek self-assigned this Oct 8, 2024
@vdusek vdusek self-requested a review October 8, 2024 17:52
@vdusek vdusek removed their assignment Oct 8, 2024
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried to execute the code samples?

docs/guides/avoid_blocking_playwright.py Outdated Show resolved Hide resolved
docs/guides/avoid_getting_blocked.mdx Outdated Show resolved Hide resolved
docs/guides/avoid_getting_blocked.mdx Outdated Show resolved Hide resolved
Comment on lines 5 to 10
browser_pool = BrowserPool.with_default_plugin(
headless=True,
kwargs={
'use_fingerprints': False,
},
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional kwargs can be provided directly.

Suggested change
browser_pool = BrowserPool.with_default_plugin(
headless=True,
kwargs={
'use_fingerprints': False,
},
)
browser_pool = BrowserPool.with_default_plugin(
headless=True,
use_fingerprints=False,
)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this it's not working directly so I added the extra code in the conversation as well and then there's a new issue , I have attached screenshots for it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is not going to work, since use_fingerprints is the parameter of the Plugin and not the BrowserPool.

Comment on lines 8 to 23
kwargs={
'use_fingerprints': True,
'fingerprint_options': {
'fingerprint_generator_options': {
'browsers': [
{
'name': 'chromium', # Or 'firefox', or 'webkit'
'min_version': 96,
},
],
'devices': ['desktop'], # Specify device types directly
'operating_systems': ['windows'], # Specify OS types directly
},
},
},
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I wrote below, additional kwargs can be provided directly. But in this case, I'm not sure whether this is correct. Have you tried to execute it?

@fnesveda fnesveda added the t-tooling Issues with this label are in the ownership of the tooling team. label Oct 9, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@MostlyKIGuess
Copy link
Author

@vdusek , Hey so I was experimenting by changing the source code , and i found:

  • We can only use Firefox on Linux/ show on windows , here are the attached images by changing it to linux, mac, windows.
    image
    image
    image

  • Similarly for chromium, only on linux/show on mac. by keeping any options:
    image

from __future__ import annotations

from logging import getLogger
from typing import TYPE_CHECKING, Any

from playwright.async_api import Playwright, async_playwright
from typing_extensions import override

from crawlee.browsers._base_browser_plugin import BaseBrowserPlugin
from crawlee.browsers._playwright_browser_controller import PlaywrightBrowserController

if TYPE_CHECKING:
    from collections.abc import Mapping
    from types import TracebackType

    from crawlee.browsers._types import BrowserType

logger = getLogger(__name__)


class PlaywrightBrowserPlugin(BaseBrowserPlugin):
    """A plugin for managing Playwright automation library.

    It should work as a factory for creating new browser instances.
    """

    AUTOMATION_LIBRARY = 'playwright'

    def __init__(
        self,
        *,
        browser_type: BrowserType = 'chromium',
        browser_options: Mapping[str, Any] | None = None,
        page_options: Mapping[str, Any] | None = None,
        max_open_pages_per_browser: int = 20,
        fingerprint_generator_options: Mapping[str, Any] | None = None,
        use_fingerprints: bool = False,
    ) -> None:
        """Create a new instance.

        Args:
            browser_type: The type of the browser to launch.
            browser_options: Options to configure the browser instance.
            page_options: Options to configure a new page instance.
            max_open_pages_per_browser: The maximum number of pages that can be opened in a single browser instance.
                Once reached, a new browser instance will be launched to handle the excess.
            fingerprint_generator_options: Options for generating browser fingerprints.
            use_fingerprints: Whether to use browser fingerprints.
        """
        self._browser_type = browser_type
        self._browser_options = browser_options or {}
        self._page_options = page_options or {}
        self._max_open_pages_per_browser = max_open_pages_per_browser
        self._fingerprint_generator_options = fingerprint_generator_options or {}
        self._use_fingerprints = use_fingerprints

        self._playwright_context_manager = async_playwright()
        self._playwright: Playwright | None = None

    @property
    @override
    def browser_type(self) -> BrowserType:
        return self._browser_type

    @property
    @override
    def browser_options(self) -> Mapping[str, Any]:
        return self._browser_options

    @property
    @override
    def page_options(self) -> Mapping[str, Any]:
        return self._page_options

    @property
    @override
    def max_open_pages_per_browser(self) -> int:
        return self._max_open_pages_per_browser

    @property
    def fingerprint_generator_options(self) -> Mapping[str, Any]:
        return self._fingerprint_generator_options

    @property
    def use_fingerprints(self) -> bool:
        return self._use_fingerprints

    @override
    async def __aenter__(self) -> PlaywrightBrowserPlugin:
        logger.debug('Initializing Playwright browser plugin.')
        self._playwright = await self._playwright_context_manager.__aenter__()
        return self

    @override
    async def __aexit__(
        self,
        exc_type: type[BaseException] | None,
        exc_value: BaseException | None,
        exc_traceback: TracebackType | None,
    ) -> None:
        logger.debug('Closing Playwright browser plugin.')
        await self._playwright_context_manager.__aexit__(exc_type, exc_value, exc_traceback)

    @override
    async def new_browser(self) -> PlaywrightBrowserController:
        if not self._playwright:
            raise RuntimeError('Playwright browser plugin is not initialized.')

        if self._browser_type == 'chromium':
            browser = await self._playwright.chromium.launch(**self._browser_options)
        elif self._browser_type == 'firefox':
            browser = await self._playwright.firefox.launch(**self._browser_options)
        elif self._browser_type == 'webkit':
            browser = await self._playwright.webkit.launch(**self._browser_options)
        else:
            raise ValueError(f'Invalid browser type: {self._browser_type}')

        return PlaywrightBrowserController(
            browser,
            max_open_pages_per_browser=self._max_open_pages_per_browser,
        )

# Updated avoid_blocking_playwright.py
from crawlee.browsers import BrowserPool
from crawlee.playwright_crawler import PlaywrightCrawler
from crawlee.browsers._playwright_browser_plugin import PlaywrightBrowserPlugin
import asyncio

# Create the PlaywrightBrowserPlugin with customized options
plugin = PlaywrightBrowserPlugin(
    browser_type='chromium',  # Use 'chromium', 'firefox'
    browser_options={
        'args': [
            '--no-sandbox',
            '--disable-setuid-sandbox',
        ],
    },
    fingerprint_generator_options={
        'devices': ['desktop'], 
        'operating_systems': ['windows'],  # Specify OS types directly
    },
    use_fingerprints=True, 
)

# Create the browser pool with the customized plugin
browser_pool = BrowserPool(plugins=[plugin])

# Instantiate the PlaywrightCrawler with the customized browser pool
crawler = PlaywrightCrawler(
    browser_pool=browser_pool,
)

async def main():
    async with browser_pool:
        crawlee_page = await browser_pool.new_page()
        page = crawlee_page.page  
        await page.goto('https://www.whatismybrowser.com/')
        user_agent = await page.evaluate('navigator.userAgent')
        print(f'User-Agent: {user_agent}')
        await page.screenshot(path='screenshot.png') 

asyncio.run(main())

@MostlyKIGuess
Copy link
Author

@vdusek So if you suggest we can only keep those 3 options in the documentation, let me know what should be added, I have tweaked the source a little bit because it wasn't accepting plugin option in the above code

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@MostlyKIGuess
Copy link
Author

Hey @vdusek , can you please guide me on do we just keep the 3 options as I mentioned above or wait until features get implemented, I think keeping what we have along with additional tips would be a better option than nothing anyways

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fingerprinting in Crawlee for Python is currently very limited. We have implemented only basics so far, see #401 and #402. The next step is #549. It means, you cannot just copy content from the JS guide.

Next steps...

  • Write the guide only with current (limited) feature set regarding the blocking.
  • Or wait for the fingerprinting to be completely implemented.

Comment on lines 5 to 10
browser_pool = BrowserPool.with_default_plugin(
headless=True,
kwargs={
'use_fingerprints': False,
},
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is not going to work, since use_fingerprints is the parameter of the Plugin and not the BrowserPool.

@vdusek vdusek changed the title docs: Added documentation on how to Avoid getting blocked docs: Add a new guide on how to avoid getting blocked Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a new guide about how to not get blocked
4 participants