CrawlSpider简介
Spider可以做很多爬虫了,但是 CrawlSpider 是为全站爬取而生
创建 CrawlSpider 爬虫工程
scrapy startproject wxapp
创建 CrawlSpider 爬虫
scrapy genspider -t crawl [爬虫名称] [爬取网站的域名]
scrapy genspider -t crawl wxapp_spider wxapp-union.com
修改 settings.py 设置
# 设置 user-agent # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = '' # 关闭 机器人 协议 # Obey robots.txt rules ROBOTSTXT_OBEY = False # 设置延迟 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 1 # 设置 headers # Override the default request headers: DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', } # 开启 pipelines 文件写入 # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'wxapp.pipelines.WxappPipeline': 300, }
制定 Items 传输
打开 items.py
class WxappItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() # 文章标题 title = scrapy.Field() # 文章内容 article = scrapy.Field()
编写获取页面和文章链接的规则
进入 /spiders
目录,查看 wxapp_spider.py
的爬虫类是继承于 CrawlSpider
类
可在 rules = () 里定义多个 Rule() 规则
Rule() 参数详解:
- LinkExtractor(allow=r’匹配规则’):指定在页面里
匹配链接的规则
,支持正则表达式 - callback=’函数名称’:匹配到 链接 所执行的函数
- follow=True:爬取了匹配的页面后,是否在匹配页面再进行匹配 (可实现翻页操作)
name = 'wxapp_spder' allowed_domains = ['www.wxapp-union.com'] start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=1&page=1'] rules = ( # 设置匹配页数的规则(可以设置多个匹配连接的 Rule() ) Rule( # 页数的连接的匹配规则(支持正则表达式) LinkExtractor(allow=r'.+mod=list&catid=1&page=\d+'), # 匹配到链接 进行解析的函数 follow=True ), # 匹配文章页面链接 Rule( # 匹配文章链接 LinkExtractor(r'http://www.wxapp-union.com/article-(.*)-(.*)\.html'), # 匹配到文章链接 的解析函数 callback='parse_article', follow=False, ) )
添加解析文章页面的函数 parse_article()
导入 from wxapp.items import WxappItem
def parse_item(self, response): pass # print(response.url) def parse_article(self, response): title = response.xpath('//div[@class="h hm cl"]/div/h1/text()').get() articleList = response.xpath('//td[@id="article_content"]//text()').getall() article = "".join(articleList).split() # 拼接返回的列表,去除前后空白字符 yield WxappItem(title=title,article=article)
存储爬取数据
打开 pipeline.py 文件
导入 from scrapy.exporters import JsonLinesItemExporter
模块
class WxappPipeline(object): def open_spider(self,spider):: # 以二进制方式打开文件 self.ft = open('articleInfo.json','wb') self.exporter = JsonLinesItemExporter(self.ft,ensure_ascii=False,encoding='utf-8') def close_spider(self,spider): self.ft.close() def process_item(self, item, spider): self.exporter.export_item(item) return item
运行爬虫
scrapy crawl wxapp_spider
版权声明:《 Scrapy框架-CrawlSpider爬虫案例:爬取小程序社区文章 》为明妃原创文章,转载请注明出处!
最后编辑:2019-11-8 13:11:06