scrapy-scrapy如何循环(3)

scrapy-scrapy如何循环

    1. General
    1. Scrapy 轮询逻辑
    1. _next_request_from_scheduler

1. General

scrapy 循环核心是reactor 只的task.LoopingCall,并启动每隔slot.heartbeat.start(5) 5秒调度一次

2. Scrapy 轮询逻辑

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def _next_request(self, spider):
slot = self.slot
if not slot:
return

if self.paused:
return
# 限制并发数的(如果request连接大于小于一个数,respons内容小于一个数)继续网络下载抓取
while not self._needs_backout(spider):
# 1. 网络下载
if not self._next_request_from_scheduler(spider):
break

if slot.start_requests and not self._needs_backout(spider):
try:
# 2. 第一次启动调用自己写的start_requests
request = next(slot.start_requests)
except StopIteration:
slot.start_requests = None
except Exception:
slot.start_requests = None
logger.error('Error while obtaining start requests',
exc_info=True, extra={'spider': spider})
else:
self.crawl(request, spider)

if self.spider_is_idle(spider) and slot.close_if_idle:
# 3. 开启redis新的调度
self._spider_idle(spider)
  1. _next_request 相当于while true中的循环体
  2. 循环体中有3次集合用于获取种子
    • self._next_request_from_scheduler(spider) 第一处也是最重要的一处
    • request = next(slot.start_requests) 第二处获取种子,也就是自己重写的start_requests
    • self._spider_idle(spider) 注册了signals.spider_idle 信号的,比如scrapy-redis
  3. 外层调度方式 self.crawler.engine.crawl 把种子传进来
  4. self.heartbeat = task.LoopingCall(nextcall.schedule),slot.heartbeat.start(5) 这才是循环的核心

3. _next_request_from_scheduler