Scrapy框架-CrawlSpider爬虫案例:爬取小程序社区文章

CrawlSpider简介

Spider可以做很多爬虫了,但是 CrawlSpider 是为全站爬取而生

创建 CrawlSpider 爬虫工程

scrapy startproject wxapp

创建 CrawlSpider 爬虫

scrapy genspider -t crawl [爬虫名称] [爬取网站的域名]

scrapy genspider -t crawl wxapp_spider wxapp-union.com

修改 settings.py 设置

# 设置 user-agent
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ''

# 关闭 机器人 协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# 设置延迟
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1

# 设置 headers
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

# 开启 pipelines 文件写入
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'wxapp.pipelines.WxappPipeline': 300,
}

制定 Items 传输

打开 items.py

class WxappItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 文章标题
    title = scrapy.Field()
    # 文章内容
    article = scrapy.Field()

编写获取页面和文章链接的规则

进入 /spiders 目录,查看 wxapp_spider.py 的爬虫类是继承于 CrawlSpider
mark

可在 rules = () 里定义多个 Rule() 规则

Rule() 参数详解:
  • LinkExtractor(allow=r’匹配规则’):指定在页面里匹配链接的规则,支持正则表达式
  • callback=’函数名称’:匹配到 链接 所执行的函数
  • follow=True:爬取了匹配的页面后,是否在匹配页面再进行匹配 (可实现翻页操作)
name = 'wxapp_spder'
allowed_domains = ['www.wxapp-union.com']
start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=1&page=1']

rules = (
        # 设置匹配页数的规则(可以设置多个匹配连接的 Rule() )
        Rule(
            # 页数的连接的匹配规则(支持正则表达式)
            LinkExtractor(allow=r'.+mod=list&catid=1&page=\d+'),
            # 匹配到链接 进行解析的函数
            follow=True
        ),

        # 匹配文章页面链接
        Rule(
            # 匹配文章链接
            LinkExtractor(r'http://www.wxapp-union.com/article-(.*)-(.*)\.html'),
            # 匹配到文章链接 的解析函数
            callback='parse_article',
            follow=False,
        )
    )

添加解析文章页面的函数 parse_article()

导入 from wxapp.items import WxappItem

    def parse_item(self, response):
        pass
        # print(response.url)

    def parse_article(self, response):
        title = response.xpath('//div[@class="h hm cl"]/div/h1/text()').get()
        articleList = response.xpath('//td[@id="article_content"]//text()').getall()
        article = "".join(articleList).split() # 拼接返回的列表,去除前后空白字符
        yield WxappItem(title=title,article=article)

存储爬取数据

打开 pipeline.py 文件

导入 from scrapy.exporters import JsonLinesItemExporter 模块

class WxappPipeline(object):

    def open_spider(self,spider)::
        # 以二进制方式打开文件
        self.ft = open('articleInfo.json','wb')
        self.exporter = JsonLinesItemExporter(self.ft,ensure_ascii=False,encoding='utf-8')

    def close_spider(self,spider):
        self.ft.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

运行爬虫

scrapy crawl wxapp_spider

mark

发表评论 / Comment

用心评论~