scrapy爬取58同城二手房问题与对策-白红宇

scrapy爬取58同城二手房问题与对策

阅读量：5069 次

发布时间：2019-06-12

本文共 12320 字，大约阅读时间需要 41 分钟。

测试环境：

win10，单机爬取，scrapy1.5.0，python3.6.4，mongodb，Robo 3T

其他准备：

代理池：测试环境就没有用搭建的flask抓代理，因为我找到的几个免费网站有效ip不够多，因此从xxx网站批量获取了800多个免费https代理，然后开线程池测试访问58同城网站，将有效ip保存到json文本中，在scrapy代码加proxy中间件，每次从json中random一个代理;

请求头：网上搜集各种网站的User-Agent，在scrapy中加UserAgent中间件，每次请求random一个UserAgent;

settings.py：

BOT_NAME = 'oldHouse'SPIDER_MODULES = ['oldHouse.spiders']NEWSPIDER_MODULE = 'oldHouse.spiders'ROBOTSTXT_OBEY = FalseDOWNLOAD_DELAY=1RETRY_TIMES = 8MONGO_URI = 'localhost'MONGO_DATABASE = 'old58House'ITEM_PIPELINES = {   'oldHouse.pipelines.MongoPipeline': 300,}DOWNLOADER_MIDDLEWARES = {    'oldHouse.middlewares.OldhouseDownloaderMiddleware': 543,    'oldHouse.middlewares.MyProxyMiddleWare': 542,    'oldHouse.middlewares.MyUserAgentMiddleWare': 541,    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': None,    'oldHouse.middlewares.MyRedirectMiddleware': 601,    'oldHouse.middlewares.MyRetryMiddleware': 551,    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,}

以下所有分析中：

real_url表示58同城url链接中给的正确url，如https://bj.58.com/ershoufang/37786966127392x.shtml

fake_url表示58同城url链接中含'zd_p'的url,需要我们对它进行跳转，跳到real_url，如https://short.58.com/zd_p/887076ce-1bfa-4142-ae0f-59c079a078e9/

jump_url表示由fake_url跳转到的url，它是获取到real_url的桥梁，如

firewall表示58同城服务器上的一个验证url,如GET https://callback.58.com/firewall/verifycode?......

一、在爬取过程中，出现以下情形：

1）real_url -> firewall - > firewall -> firewall -> 重试过多，死掉。给定正确url，由于ip频繁访问，跳到58频繁验证的url，由于没有写模拟验证代码，重试两次之后放弃该url案例：2019-04-16 14:19:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to 
     
       from 
      
       2019-04-16 14:19:18 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying 
       
         (failed 1 times): An error occurred while connecting: [Failure instance: Traceback (failure with no frames): 
        
         : Connection to the other side was lost in a non-clean fashion: Connection lost.2）real_url -> firewall -> firewall，拿到firewall的页面信息 -> 由于拿到错误页面，在做数据提取时出现NoneType error报错案例：2019-04-16 14:18:49 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to 
         
           from 
          
           2019-04-16 14:18:52 [scrapy.core.engine] DEBUG: Crawled (200) 
           
             (referer: https://bj.58.com/ershoufang/)2019-04-16 14:18:52 [scrapy.core.scraper] ERROR: Spider error processing 
            
              (referer: https://bj.58.com/ershoufang/)Traceback (most recent call last):3）fake_url -> jump_url -> jump_url -> jump_url放弃url案例：2019-04-16 16:24:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to 
             
               from 
              2019-04-16 16:24:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying 
              
                (failed 1 times): TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。.2019-04-16 16:24:59 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying 
               
                 (failed 2 times): Could not open CONNECT tunnel with proxy 104.236.248.219:3128 [{ 'status': 503, 'reason': b'Service Unavailable'}]2019-04-16 16:25:05 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying 
                
                  (failed 3 times): [
                 
                  ]4）fake_url -> jump_url -> real_url - > firewal难得拿到real_url，又因为请求频繁等碰墙上了案例：2019-04-16 14:19:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to 
                  
                    from 
                   2019-04-16 14:19:12 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to 
                   
                     from 
                    
                     2019-04-16 14:19:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to 
                     
                       from

二、针对以上情形的解决办法

总思路：

由于爬取这些数据无需登录，那么针对58firewall给的较难破解的轨迹验证方式我们换个ip就好了；

观察没有成功访问的原因主要是出在redirect和retry上，对于retry，由于我管理的代理有效率并不高，并且没在用flask维护实时代理，因此我会给更大的RETRY_TIMES；对于redirect，可以看到以上有各种url之间的redirect，必然要用上redirect中间件，并且根据不同类型的redirect做不同的process_response，下面详细解决redirect问题

解决工作：

简单概括以上四种redirect：1.real_url -> firewall - > firewall -> firewall -> 重试过多，死掉原因在于请求过于频繁，且设置了允许重定向，导致到了firewall而不是重新爬real_url2.real_url -> firewall -> firewall，拿到firewall的页面信息 -> 由于拿到错误页面，在做数据提取时出现NoneType error原因在于请求过于频繁，且设置了允许重定向，导致到了firewall而不是重新爬real_url3.fake_url -> jump_url -> jump_url -> jump_url放弃url极有可能是代理原因导致不停重试4.fake_url -> jump_url -> real_url - > firewal，难得拿到real_url，又因为请求频繁等碰墙上了从fake_url终于重定向到real_url之后仍有可能由于请求频发导致撞墙，出现第1中情形逐个分析办法：若直接settings.py设置REDIRECT_ENABLED=False就好了，那是不行滴，如情形4，居然能从fake_url跳跳跳一直跳到我们需要的real_url，这就是58同城设的套啊1和2情形的方案：1和2自从real_url跳到firewall后就偏离了我们的工作，那么针对real_url我不让它跳转就行了，若当前是real_url则在scrapy.Request中设置dont_redirect=True(默认False)，但是这还没完，real_url说你不让我跳转却又给我分配了一个垃圾IP，强行让我撞墙，撞了墙又不处理一下，好比是这样，小强说帮小明打架，结果小强根本没去，小明被迫1v5被打得鼻青脸肿，小强正在家里快乐风男。这样的结果是对于本次redirect没有后续处理，日志出现debug:2019-04-17 08:10:06 [scrapy.core.engine] DEBUG: Crawled (302) 
      (referer: https://bj.58.com/ershoufang/)2019-04-17 08:10:06 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <302 https://bj.58.com/ershoufang/>: HTTP status code is not handled or not allowed2019-04-17 08:10:06 [scrapy.core.engine] INFO: Closing spider (finished)如果这个real_url出现在后续url中还好，如上所示，real_url出现在初始url，导致第一个url都没爬到就直接关闭爬虫，gg。那么如何处理呢？干脆不要针对real_url设置dont_redirect=True了，保持默认全局允许重定向就好，自定义MyRedirectMiddleware，完整继承RedirectMiddleware方法下，提供监测机制，检测位置：real_url -> firewall，捕捉这个real_url，在他想跳还没跳起来之前return Request(real_url...)就好，这还没完，由于real_url开始是爬取过的，在finger_print中是有记录的，那么记得加参数dont_filter=True,并且记得加callback=spider.parse_xxx情形3方案：自定义MyRedirectMiddleware，完整继承RedirectMiddleware方法下，提供监测机制，检测位置：fake_url -> jump_url，若发现当前跳转到的目标url为jump_url，就提供更多的重试次数，由于设置好了代理中间件，基本能保证最终拿到real_url了。情形4方案：自定义MyRedirectMiddleware，完整继承RedirectMiddleware方法下，提供监测机制，检测位置：jump_url -> real_url，若发现当前跳转到的目标url非jump_url或firewall，则基本确定获取到real_url了，那么就让重定向到real_url上就好了。说了这么多，好消息是我们不用管情形4了，jump_url - > real_url部分由于全局允许重定向，并且在情形3jump_url设置了不停的retry,是一定能拿到real_url的，而real_url - > firewall部分不正是情况1所要解决的吗，所以情形4方案迎刃而解。

具体方案代码：选自redirect中间件部分代码

# -*- coding:utf-8 -*-# Author: Tarantiner# @Time :2019/4/17 18:26class MyRedirectMiddleware(BaseRedirectMiddleware):    def process_response(self, request, response, spider):        if (request.meta.get('dont_redirect', False) or                response.status in getattr(spider, 'handle_httpstatus_list', []) or                response.status in request.meta.get('handle_httpstatus_list', []) or                request.meta.get('handle_httpstatus_all', False)):            return response        allowed_status = (301, 302, 303, 307, 308)        if 'Location' not in response.headers or response.status not in allowed_status:            return response        location = safe_url_string(response.headers['location'])        redirected_url = urljoin(request.url, location)        if response.status in (301, 307, 308) or request.method == 'HEAD':            redirected = request.replace(url=redirected_url)            return self._redirect(redirected, request, spider, response.status)        if 'firewall' in redirected_url:            # 为防止1、2类情况：real_url -> firewall            return Request(response.url, callback=spider.parse_detail, dont_filter=True)        if 'Jump' in redirected_url:            # 为防3类情况：fake_url -> jump_url -> jump_url -> jump_url放弃url            new_request = request.replace(url=redirected_url, method='GET', body='', meta={'max_retry_times': 12})  # 每次遇到这个跳转url都会加一次retry就是无线retry了        else:            new_request = self._redirect_request_using_get(request, redirected_url)        return self._redirect(new_request, request, spider, response.status)

解决后的爬取效果：

情形1、2的效果：real_url -> real_url -> 200，如下redirected_url:https://callback.58.com/firewall/verifycode?serialId=8b8b4a1ead5a3ded505d96dcc8e42004_21b60bb0e6194aeea99c0b42f0f99c2f&code=22&sign=f86bd444c70b93fc537503ef857276ec&namespace=ershoufangphp&url=https%3A%2F%2Fbj.58.com%2Fershoufang%2F37688560543505x.shtmlresponse_url:https://bj.58.com/ershoufang/37688560543505x.shtml2019-04-17 18:49:02 [scrapy.core.engine] DEBUG: Crawled (200) 
     
       (referer: https://bj.58.com/ershoufang/)2019-04-17 18:49:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bj.58.com/ershoufang/37610200172685x.shtml?adtype=3>可见，在real_url -> firewall之后，并没有真正爬取firewall，而是继续爬取real_url，返回200情形3的效果：fake_url -> jump_url -> jump_url -> real_url -> 200，如下2019-04-16 22:43:27 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to 
      
        from 
       2019-04-16 22:43:59 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying 
       
         (failed 1 times): TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。2019-04-16 22:44:17 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to 
        
          from 
         
          2019-04-16 22:44:38 [scrapy.core.engine] DEBUG: Crawled (200) 
          
            (referer: https://bj.58.com/ershoufang/)2019-04-16 22:44:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bj.58.com/ershoufang/37587624035103x.shtml?adtype=3>情形4的效果：fake_url -> jump_url -> real_url -> 200，如下2019-04-16 22:43:33 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to 
           
             from 
            2019-04-16 22:43:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to 
            
              from 
             
              2019-04-16 22:43:53 [scrapy.core.engine] DEBUG: Crawled (200) 
              
                (referer: https://bj.58.com/ershoufang/)2019-04-16 22:43:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bj.58.com/ershoufang/37756045083030x.shtml?adtype=3>

当然，我提取到一份绝好的日志，看着这url如预期般地redirect真是舒服了：

fake_url -> jump_url -> real_url -> retry 1 times -> retry 2 times --- firewall但是并没有真的去，而是重新Request ---> real_url -> 2002019-04-17 16:26:52 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to 
     
       from 
      2019-04-17 16:27:04 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to 
      
        from 
       
        2019-04-17 16:27:26 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying 
        
          (failed 1 times): TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。2019-04-17 16:28:00 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying 
         
           (failed 2 times): TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。redirect_url:https://callback.58.com/firewall/verifycode?serialId=70e3ea25cb505bc3d0746bb61d508d53_6da701bcb6ca44fd92bbe820a73dca84&code=22&sign=cc2a1d287fa102f0f21d33d91b3c51ea&namespace=ershoufangphp&url=https%3A%2F%2Fbj.58.com%2Fershoufang%2F37688797098651x.shtml%3Fadtype%3D3response_url:https://bj.58.com/ershoufang/37688797098651x.shtml?adtype=32019-04-17 16:28:50 [scrapy.core.engine] DEBUG: Crawled (200) 
          
            (referer: https://bj.58.com/ershoufang/)2019-04-17 16:28:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bj.58.com/ershoufang/37688797098651x.shtml?adtype=3>这个爬取路径可以说走过了全部4种情形，而最终还是顺利爬取到数据，应该比较有代表性了

其中一份日志结果：

{
   'downloader/exception_count': 136, 'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 16, 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 11, 'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 76, 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 30, 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3, 'downloader/request_bytes': 384128, 'downloader/request_count': 750, 'downloader/request_method_count/GET': 750, 'downloader/response_bytes': 2385832, 'downloader/response_count': 614, 'downloader/response_status_count/200': 123, 'downloader/response_status_count/302': 490, 'downloader/response_status_count/504': 1, 'dupefilter/filtered': 122, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 4, 17, 10, 52, 26, 392186), 'item_scraped_count': 122, 'log_count/DEBUG': 500, 'log_count/INFO': 27, 'log_count/WARNING': 2, 'request_depth_max': 1, 'response_received_count': 123, 'retry/count': 137, 'retry/reason_count/504 Gateway Time-out': 1, 'retry/reason_count/scrapy.core.downloader.handlers.http11.TunnelError': 16, 'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 11, 'retry/reason_count/twisted.internet.error.TCPTimedOutError': 76, 'retry/reason_count/twisted.internet.error.TimeoutError': 30, 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 3, 'scheduler/dequeued': 750, 'scheduler/dequeued/memory': 750, 'scheduler/enqueued': 750, 'scheduler/enqueued/memory': 750, 'start_time': datetime.datetime(2019, 4, 17, 10, 33, 6, 936247)}2019-04-17 18:52:26 [scrapy.core.engine] INFO: Spider closed (finished)

可以看到还有许多地方需要改进，后续会分享我的优化思路O(∩_∩)O

转载于:https://www.cnblogs.com/tarantino/p/10725656.html

你可能感兴趣的文章