Python抓取组抓取数据,pythonscrapy,分组,爬取

发表时间:2020-10-15

之前爬去的内容都是一个一个内容进行爬取,今天主要是根据一组html标签进行爬取,比如说一个ul标签里面有很多的li标签,然后进行爬取~

今天爬取的是一个新闻页面,每个页面下面有个下载链接,然后爬取了新闻名称和里面的下载链接,其他网站操作估计也类似。这里的代码如下:

一、爬虫文件:

import scrapy

class File01Spider(scrapy.Spider):
    name = 'file01'
    # allowed_domains = ['www.baidu.com']
    start_urls = ['http://smzt.gd.gov.cn/zwgk/rsrm/']

    def parse(self, response):
        News_list = response.xpath('//ul[@class="News_list"]/li')
        for news in News_list:
            newsname = news.xpath('./a/text()').extract()[0]
            newsurl = news.xpath('./a/@href').extract()[0]
            item = {'newsname':newsname,'newsurl':newsurl}
            yield scrapy.Request(item['newsurl'],callback=self.newsfile,meta={'item':item})
    def newsfile(self, response):
        item = response.meta['item']
        item['downlink'] = response.xpath('//div[@class="info_cont"]//a/@href').extract()

        if len(item['downlink']) == 1:
            item['downlink'] = item['downlink'][0]
        else:
            item['downlink'] = "没有下载链接"
        yield item

二、item文件内容:

import scrapy

class YxqItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    newsname = scrapy.Field()
    downlink = scrapy.Field()

三、pipelines文件内容:

class YxqPipeline:

    def process_item(self, item, spider):
        newsname = item['newsname']
        downlink = item['downlink']
        print(newsname,downlink)
        return item

效果图:

文章来源互联网,如有侵权,请联系管理员删除。邮箱:417803890@qq.com / QQ:417803890

微配音

Python Free

邮箱:417803890@qq.com
QQ:417803890

皖ICP备19001818号
© 2019 copyright www.pythonf.cn - All rights reserved

微信扫一扫关注公众号:

联系方式

Python Free