URLs in start_urls are not affected #1

bezkos · 2017-05-30T18:23:31Z

I have a spider crawl only detail pages and they are never skipped by this middleware.

kmike · 2017-05-30T19:14:26Z

A good catch; we need to add process_start_requests method as well.

Verz1Lka · 2018-06-05T11:17:50Z

@bezkos Are you use meta={'crawl_once': True}?
I tested middleware using this simple spider, and that's works correctly.

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, meta={'crawl_once': True})

def parse(self, response):
    yield {
        'title': response.css('h1 a::text').extract_first(),
    }

First run - request sent.

{'crawl_once/initial': 0,
 'crawl_once/stored': 1,
 'downloader/request_bytes': 231,
 'downloader/request_count': 1}

Second run - request ignored.

{'crawl_once/ignored': 1,
 'crawl_once/initial': 1,
 'downloader/exception_count': 1,
 'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1}

Note: requests generated by start_urls has not crawl_once in meta dictionary by default. For append it, use start_requests method.

Can you explain what problem you had?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URLs in start_urls are not affected #1

URLs in start_urls are not affected #1

bezkos commented May 30, 2017

kmike commented May 30, 2017

Verz1Lka commented Jun 5, 2018 •

edited

Loading

URLs in start_urls are not affected #1

URLs in start_urls are not affected #1

Comments

bezkos commented May 30, 2017

kmike commented May 30, 2017

Verz1Lka commented Jun 5, 2018 • edited Loading

Verz1Lka commented Jun 5, 2018 •

edited

Loading