scrapy-中间件源码分析(4)

scrapy-中间件源码分析(3)

    1. General
    1. MiddlewareManager(父)
    • 2.1. MiddlewareManager 是中间件的抽象类,他的子类包括.以及代码中加载顺序
    • 2.2. MiddlewareManager 初始化
    • 2.3. 存储数据结构
    1. ExtensionManager(子)
    • 3.1. 首先看下程序启动时的打印代码
    • 3.2. 扩展类中间件采用父类的加载
    • 3.3. 中间件的key说明
    1. DownloaderMiddlewareManager(子)
    • 4.1. 首先看下程序启动时的打印代码
    • 4.2. DownloaderMiddleware加载源码
    • 4.3. 中间件的key说明
    • 4.4. 调用方式
    1. SpiderMiddlewareManager(子)
    • 5.1. 首先看下程序启动时的打印代码
    • 5.2. SpiderMiddleware加载源码
    • 5.3. 中间件的key说明
    • 5.4. 调用方式
    1. ItemPipelineManager(子)
    • 6.1. 首先看下程序启动时的打印代码
    • 6.2. ItemPipeline加载源码
    • 6.3. 中间件的key说明
    • 6.4. 调用方式

1. General

2. MiddlewareManager(父)

2.1. MiddlewareManager 是中间件的抽象类,他的子类包括.以及代码中加载顺序

  • ExtensionManager 扩展中间件 1
  • DownloaderMiddlewareManager 下载中间件 2
  • SpiderMiddlewareManager 爬虫中间件 3
  • ItemPipelineManager pipline中间件 4

2.2. MiddlewareManager 初始化

1
2
3
4
5
6
7
def __init__(self, *middlewares):
# 所有的中间件
self.middlewares = middlewares
# 中间件存放的数据结构,map ,每一个map位是一个队列
self.methods = defaultdict(deque)
for mw in middlewares:
self._add_middleware(mw)

2.3. 存储数据结构

  • 所有的中间件中重写的方法都会放在 self.methods = defaultdict(deque) 这个map
  • map的key做了分类,详细见每个中间件的key解释
  • 中间件的顺序是排好序后 append,appendleft 来实现

3. ExtensionManager(子)

3.1. 首先看下程序启动时的打印代码

1
2
3
4
5
6
2020-08-21 10:26:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats', 核心统计收集器
'scrapy.extensions.telnet.TelnetConsole', 监听控制台服务启动
'scrapy.extensions.memusage.MemoryUsage', 监控内存做一些应急处理
'scrapy.extensions.logstats.LogStats', 日志统计打印的
'social_spider.middlewares.SpiderCallBackErrCount'] 自定义收集器

3.2. 扩展类中间件采用父类的加载

1
2
3
4
5
6
7
# 实际上扩展类中间件并没有保存到字典,直接在 from_crawler 中做处理,通过信号注册
def _add_middleware(self, mw):
if hasattr(mw, 'open_spider'):
self.methods['open_spider'].append(mw.open_spider)
if hasattr(mw, 'close_spider'):
self.methods['close_spider'].appendleft(mw.close_spider)

3.3. 中间件的key说明

open_spider
close_spider

4. DownloaderMiddlewareManager(子)

4.1. 首先看下程序启动时的打印代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
2020-08-21 10:29:00 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'social_spider.middlewares.ProxyMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'social_spider.middlewares.StatMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']

4.2. DownloaderMiddleware加载源码

1
2
3
4
5
6
7
8
# 重写函数,所有的 DOWNLOADER_MIDDLEWARES 都会保存到process_request,process_response,process_response, 看你重写了啥就保存啥
def _add_middleware(self, mw):
if hasattr(mw, 'process_request'):
self.methods['process_request'].append(mw.process_request)
if hasattr(mw, 'process_response'):
self.methods['process_response'].appendleft(mw.process_response)
if hasattr(mw, 'process_exception'):
self.methods['process_exception'].appendleft(mw.process_exception)

4.3. 中间件的key说明

process_request
process_response
process_exception

4.4. 调用方式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def download(self, download_func, request, spider):
@defer.inlineCallbacks
def process_request(request):
for method in self.methods['process_request']:
response = yield deferred_from_coro(method(request=request, spider=spider))
if response is not None and not isinstance(response, (Response, Request)):
raise _InvalidOutput('Middleware %s.process_request must return None, Response or Request, got %s' % \
(method.__self__.__class__.__name__, response.__class__.__name__))
if response:
defer.returnValue(response)
defer.returnValue((yield download_func(request=request, spider=spider)))

@defer.inlineCallbacks
def process_response(response):
assert response is not None, 'Received None in process_response'
if isinstance(response, Request):
defer.returnValue(response)

for method in self.methods['process_response']:
response = yield deferred_from_coro(method(request=request, response=response, spider=spider))
if not isinstance(response, (Response, Request)):
raise _InvalidOutput('Middleware %s.process_response must return Response or Request, got %s' % \
(method.__self__.__class__.__name__, type(response)))
if isinstance(response, Request):
defer.returnValue(response)
defer.returnValue(response)

@defer.inlineCallbacks
def process_exception(_failure):
exception = _failure.value
for method in self.methods['process_exception']:
response = yield deferred_from_coro(method(request=request, exception=exception, spider=spider))
if response is not None and not isinstance(response, (Response, Request)):
raise _InvalidOutput('Middleware %s.process_exception must return None, Response or Request, got %s' % \
(method.__self__.__class__.__name__, type(response)))
if response:
defer.returnValue(response)
defer.returnValue(_failure)

deferred = mustbe_deferred(process_request, request)
deferred.addErrback(process_exception)
deferred.addCallback(process_response)
return deferred

5. SpiderMiddlewareManager(子)

5.1. 首先看下程序启动时的打印代码

1
2
3
4
5
6
2020-08-21 10:37:21 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']

5.2. SpiderMiddleware加载源码

1
2
3
4
5
6
7
8
9
10
def _add_middleware(self, mw):
super(SpiderMiddlewareManager, self)._add_middleware(mw)
if hasattr(mw, 'process_spider_input'):
self.methods['process_spider_input'].append(mw.process_spider_input)
if hasattr(mw, 'process_start_requests'):
self.methods['process_start_requests'].appendleft(mw.process_start_requests)
process_spider_output = getattr(mw, 'process_spider_output', None)
self.methods['process_spider_output'].appendleft(process_spider_output)
process_spider_exception = getattr(mw, 'process_spider_exception', None)
self.methods['process_spider_exception'].appendleft(process_spider_exception)

5.3. 中间件的key说明

open_spider
close_spider
process_spider_input
process_start_requests
process_spider_output
process_spider_exception

5.4. 调用方式

1
2
def process_start_requests(self, start_requests, spider):
return self._process_chain('process_start_requests', start_requests, spider)

6. ItemPipelineManager(子)

6.1. 首先看下程序启动时的打印代码

1
2
2020-08-21 10:47:28 [scrapy.middleware] INFO: Enabled item pipelines:
['social_spider.pipelines.PipelineKafka']

6.2. ItemPipeline加载源码

1
2
3
4
5
def _add_middleware(self, pipe):
super(ItemPipelineManager, self)._add_middleware(pipe)
if hasattr(pipe, 'process_item'):
self.methods['process_item'].append(deferred_f_from_coro_f(pipe.process_item))

6.3. 中间件的key说明

open_spider
close_spider
process_item

6.4. 调用方式

1
2
def process_item(self, item, spider):
return self._process_chain('process_item', item, spider)