-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add gmaps scraping blog #2772
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the article, it is very good and reads well, but it doesn't use the full potential of Crawlee in some places - let's improve that 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is huge, isn't there a more adequate format than gif?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't know. any suggestions? gif kinda fits here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be webp as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
@crawler.router.default_handler | ||
async def default_handler(context): | ||
await scrape_google_maps(context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default_handler
does not make much sense here...
@crawler.router.default_handler | |
async def default_handler(context): | |
await scrape_google_maps(context) | |
crawler.router.default_handler(scrape_google_maps) |
This should be enough if you want to keep the handler definition outside of the main
function.
""" | ||
page = context.page | ||
await page.goto(context.request.url) | ||
print("Connected to:", context.request.url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print("Connected to:", context.request.url) | |
print("Processing: ", context.request.url) |
# Pretty-print the data | ||
print(json.dumps(data, indent=4)) | ||
print("\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not how it's supposed to be done - it'd be better to use context.push_data(data)
with open('google_maps_data.json', 'w', encoding='utf-8') as f: | ||
json.dump(all_data, f, ensure_ascii=False, indent=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use the default dataset for this, you can simply do crawler.export_data_json('path', ensure_ascii=False, indent=2)
First, we need a function that can handle the scrolling and detect when we've hit the bottom. Copy-paste this new function in the `gmap_scraper.py` file: | ||
|
||
```python | ||
async def load_more_items(page) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The article does not mention where the function should be called.
- Crawlee already has
context.infinite_scroll()
- does it not work in this case?
approved by adam and marketing.