Scrapy结合IPProxyPool实现代理的一些坑

简介

最近在公司爬取一个品牌的商品时，测试太多次，测试环境的IP就被对方封了，一直连接超时。然后组里老大找了GitHub上七夜的一个开源项目IPProxyPool，过程有些要完善和优化的地方。

Scrapy增加Proxy中间件

在Scrapy工程下新建middlewares.py，重写process_request()方法，request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"即为代理IP和port，大致代码如下：

import base64 
# Start your middleware class
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"
        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

在项目配置文件里setting.py添加:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'pythontab.middlewares.ProxyMiddleware': 100,
}

IPProxyPool

这个放到项目中，数据库修改的话里面有格式可以参考。跑起来是没多大问题，不过代理的时候有个问题需要注意，就是一些代理IP只能访问http，同样地，有一些代理IP只能访问https。所以这个在重写request的Proxy中间件的时候要注意，判断一些request.url的protocol的类型，再向本地服务器发送请求，大致代码如下：

def process_request(self, request, spider):
	if request.url.split(':')[0] == 'http':
		self.protocol = 'http'
	else:
		self.protocol = 'https'

在向http服务器发送请求：

if self.protocol == 'http':
	r = requests.get(self.ip_pool_href + "?count=10&type=0&protocol=0")
else:
	r = requests.get(self.ip_pool_href + "?count=10&type=0&protocol=1")
new_proxyes = self.byteify(json.loads(r.text))
r_2 = requests.get(self.ip_pool_href + "?count=10&type=0&protocol=2")
new_proxyes += self.byteify(json.loads(r_2.text))

这样就好了。
另外一个问题是，免费的代理IP大多都是http的，https的很少，一用就用光了，这时候，数据库里的代理IP全是http，也不会再继续爬取https的代理IP，后面看接下来的能不能这个修改一下，单独看https的数量来决定是否继续爬取。发现长时间运行爬取的话，是可以获得足够数量的https的。

今天又重新看了七夜的IPProxyPool，其实总体还是挺简单的，开头开了四个进程：

开一个http服务器；
从代理ip网站爬代理ip, 存队列1；
从队列1拿代理ip检测protocol等并且测试score, 之后存为队列2；
从队列2拿出代理ip，存到数据库里。

主要不太懂的还是多进程和多线程的差别，为什么这里要用多进程，还要协程那部分还是看不懂，回头再研究研究。