Python爬虫神器|pyppeteer与scrapy 的整合

发表时间:2020-02-22

Scrapy是Python开发的一个快速、高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。

Scrapy吸引人的地方在于它是一个框架,任何人都可以根据需求方便的修改。它也提供了多种类型爬虫的基类,如BaseSpider、sitemap爬虫等,最新版本又提供了web2.0爬虫的支持。


常规pyppeteer中间件

常规的pyppeteer中间件,尽管pyppeteer是基于asyncio的异步框架,但因为通过同步的方式调用,无法发挥其异步框架的优势,会将scrapy阻塞,相当于总并发降至1,参考github项目(https://github.com/Python3WebSpider/ScrapyPyppeteer.git

import websockets
from scrapy.http import HtmlResponse
from logging import getLogger
import asyncio
import pyppeteer
import logging
from concurrent.futures._base import TimeoutError


class PyppeteerMiddleware():
    def render(self, url, **kwargs):
        async def async_render(url, **kwargs):
            try:
                page = await self.browser.newPage()
                response = await page.goto(url, options={'timeout': int(timeout * 1000)})
                content = await page.content()
                
                return content, response.status
            except TimeoutError:
                return None, 500
            finally:
                if not page.isClosed():
                    await page.close()
               
        return content, status
    
    def process_request(self, request, spider):
        if request.meta.get('render') == 'pyppeteer':
            try:
                html, status = self.render(request.url)
                return HtmlResponse(url=request.url, body=html, request=request, encoding='utf-8',
                                    status=status)
            except websockets.exceptions.ConnectionClosed:
                pass
    

异步pyppeteer中间件

将pyppeteer中间件弄成异步需要进行两步操作

  1. 在process_request方法中,将pyppeteer请求函数协程异步调用,并用Deferred.fromFuture将twisted deffered 改成asyncio的future
from twisted.internet.defer import Deferred
from scrapy.http import HtmlResponse

def as_deferred(f):
    """Transform a Twisted Deffered to an Asyncio Future"""

    return Deferred.fromFuture(asyncio.ensure_future(f))


class PuppeteerMiddleware:
    async def _process_request(self, request, spider):
        """Handle the request using Puppeteer"""

        page = await self.browser.newPage()

        ......

        return HtmlResponse(
            page.url,
            status=response.status,
            headers=response.headers,
            body=body,
            encoding='utf-8',
            request=request
        )

    def process_request(self, request, spider):
        """Check if the Request should be handled by Puppeteer"""

        if request.meta.get('render') == 'pyppeteer':
            return as_deferred(self._process_request(request, spider))
        
        return None


由于scrapy是基于twisted,而pyppeteer基于asyncio,需要解决reactor的互通问题。

Twisted有一个解决方案,可以在asyncio上运行twisted,那就是asyncioreactor,不过要确保在导入scrappy或执行任何其他操作之前做处理,可以在导入execute之前先解决reactor问题

import asyncio
from twisted.internet import asyncioreactor

asyncioreactor.install(asyncio.get_event_loop())

'''
导入scrapy之前,必须先加上以上三行,否则无法对接asyncio
'''

from scrapy.cmdline import execute


execute("scrapy crawl spider_name".split())


参考github项目(https://github.com/clemfromspace/scrapy-puppeteer.git

这样就可以兼容scrapy的并发设置了。


与 scrapy 的整合

加入downloadmiddleware

from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random
import pyppeteer
import asyncio
import os
from scrapy.http import HtmlResponse
 
pyppeteer.DEBUG = False 
 
class FundscrapyDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    def __init__(self) :
        print("Init downloaderMiddleware use pypputeer.")
        os.environ['PYPPETEER_CHROMIUM_REVISION'] ='588429'
        # pyppeteer.DEBUG = False
        print(os.environ.get('PYPPETEER_CHROMIUM_REVISION'))
        loop = asyncio.get_event_loop()
        task = asyncio.ensure_future(self.getbrowser())
        loop.run_until_complete(task)
 
        #self.browser = task.result()
        print(self.browser)
        print(self.page)
        # self.page = await browser.newPage()
    async def getbrowser(self):
        self.browser = await pyppeteer.launch()
        self.page = await self.browser.newPage()
        # return await pyppeteer.launch()
    async def getnewpage(self): 
        return  await self.browser.newPage()
 
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
 
    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.
 
        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        loop = asyncio.get_event_loop()
        task = asyncio.ensure_future(self.usePypuppeteer(request))
        loop.run_until_complete(task)
        # return task.result()
        return HtmlResponse(url=request.url, body=task.result(), encoding="utf-8",request=request)
 
    async def usePypuppeteer(self, request):
        print(request.url)
        # page = await self.browser.newPage()
        await self.page.goto(request.url)
        content = await self.page.content()
        return content 
 
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.
 
        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response
 
    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.
 
        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass
 
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

文章来源互联网,尊重作者原创,如有侵权,请联系管理员删除。邮箱:417803890@qq.com / QQ:417803890


Python Free

邮箱:417803890@qq.com
QQ:417803890

皖ICP备19001818号
© 2019 copyright www.pythonf.cn - All rights reserved

微信扫一扫关注公众号:

联系方式

Python Free