Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Botasaurus can't pass CF #147

Open
JimKarvo opened this issue Jul 1, 2024 · 2 comments
Open

Botasaurus can't pass CF #147

JimKarvo opened this issue Jul 1, 2024 · 2 comments

Comments

@JimKarvo
Copy link

JimKarvo commented Jul 1, 2024

The CF seems that can detect the Botosaurus. There is no IP banned, there is no OS related problem. I have the same behavior on windows 11 and on ubuntu server.

If i emit the "wait" parameter, i get different error (like the "id" not found)

The script:

from botasaurus.browser import browser, Driver

@browser(add_arguments=['--no-sandbox'])
def scrape_heading_task(driver: Driver, data):
    # Visit the Omkar Cloud website
    driver.google_get("https://gitlab.com/users/sign_in", bypass_cloudflare=True, wait=10)
    
    # Retrieve the heading element's text
    heading = driver.get_text("h1")

    # Save the data as a JSON file in output/scrape_heading_task.json
    return {
        "heading": heading
    }
     
# Initiate the web scraping task
scrape_heading_task()

the log:

Traceback (most recent call last):
  File "/root/.venv/lib/python3.12/site-packages/botasaurus/browser_decorator.py", line 176, in run_task
    result = func(driver, data)
             ^^^^^^^^^^^^^^^^^^
  File "/root/pricecheckgrbots/delete.py", line 6, in scrape_heading_task
    driver.google_get("https://gitlab.com/users/sign_in", bypass_cloudflare=True, wait=10)
  File "/root/.venv/lib/python3.12/site-packages/botasaurus_driver/driver.py", line 536, in google_get
    self.get_via(link, "https://www.google.com/", bypass_cloudflare=bypass_cloudflare, wait=wait)
  File "/root/.venv/lib/python3.12/site-packages/botasaurus_driver/driver.py", line 522, in get_via
    self.detect_and_bypass_cloudflare()
  File "/root/.venv/lib/python3.12/site-packages/botasaurus_driver/driver.py", line 878, in detect_and_bypass_cloudflare
    bypass_if_detected(self)
  File "/root/.venv/lib/python3.12/site-packages/botasaurus_driver/solve_cloudflare_captcha.py", line 122, in bypass_if_detected
    wait_till_cloudflare_leaves(driver, previous_ray_id, raise_exception)
  File "/root/.venv/lib/python3.12/site-packages/botasaurus_driver/solve_cloudflare_captcha.py", line 64, in wait_till_cloudflare_leaves
    raise CloudflareDetectionException()
botasaurus_driver.exceptions.CloudflareDetectionException: Cloudflare has detected us.

image

@kreethandsouza
Copy link

I tried out your code in my ubuntu system. works fine for me. If no luck probably try this out

from botasaurus.browser import browser, Driver
import time


@browser(add_arguments=['--no-sandbox'])
def scrape_heading_task(driver: Driver, data):
    driver.google_get("https://gitlab.com/users/sign_in")
    time.sleep(2)
    iframe = driver.select_iframe("#turnstile-wrapper iframe")
    checkbox = iframe.select('label', None)
    if checkbox:
        checkbox.click()
    driver.prompt()
    driver.save_screenshot()

    heading = driver.get_text("h1")
    return heading


# Initiate the web scraping task
scrape_heading_task()

If necessary you might have to use proxies to access the site.

@JimKarvo
Copy link
Author

JimKarvo commented Jul 3, 2024

Still not working at ubuntu server (no gui).

I have the same IP as my windows machine. At Windows the script working without any problems.

At linux i tryied this:

from botasaurus.browser import browser, Driver
import time


@browser(add_arguments=['--no-sandbox'])
def scrape_heading_task(driver: Driver, data):
    driver.google_get("https://gitlab.com/users/sign_in")
    time.sleep(10)
    iframe = driver.select_iframe("#turnstile-wrapper iframe")
    driver.save_screenshot()
    checkbox = iframe.select('label', None)
    if checkbox:
        print("detected checkbox")
        checkbox.click()
    time.sleep(1)
    driver.save_screenshot()
    driver.prompt()
    driver.save_screenshot()

    heading = driver.get_text("h1")
    return heading


# Initiate the web scraping task
scrape_heading_task()

Seems that the checkbox isn't clicked (at second screenshot).
If I increase the timeout from 10 to 30, the turntile disappeared!

@JimKarvo JimKarvo mentioned this issue Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants