How to use scrapy-playwright with the CrawlSpider?
By specifying a process_request
method that modifies requests in-place in your
crawling rules.
For instance:
def set_playwright_true(request, response):
request.meta["playwright"] = True
return request
class MyCrawlSpider(CrawlSpider):
...
rules = (
Rule(
link_extractor=LinkExtractor(...),
callback="parse_item",
follow=False,
process_request=set_playwright_true,
),
)
If you want all requests to be processed by Playwright and don't want to repeat
yourself, or you're using a generic spider that doesn't support request
customization (e.g. scrapy.spiders.SitemapSpider
), you can use a middleware
to edit the meta
attribute for all requests.
Depending on your project and the interactions with other components, you might decide to use a spider middleware or a downloader middleware.
Spider middleware example:
class PlaywrightSpiderMiddleware:
def process_spider_output(self, response, result, spider):
for obj in result:
if isinstance(obj, scrapy.Request):
obj.meta.setdefault("playwright", True)
yield obj
Downloader middleware example:
class PlaywrightDownloaderMiddleware:
def process_request(self, request, spider):
request.meta.setdefault("playwright", True)
return None
If you're seeing messages such as JavaScript heap out of memory
, there's a
chance you're falling into the scope of
microsoft/playwright#6319. As a workaround, it's
possible to increase the amount of memory allowed for the Node.js process by
specifying a value for the the --max-old-space-size
V8 option in the
NODE_OPTIONS
environment variable, e.g.:
$ export NODE_OPTIONS=--max-old-space-size=SIZE # in megabytes
Sources & further reading: