Scarpy爬虫简单案例：爬取本网站首页链接

0x00 创建爬虫工程

进入要创建爬虫的文件下，执行下面命令

scrapy startproject bigdataboy

0x01 创建爬虫项目

进入爬虫工程目录，执行命令，创建第一个爬虫

scrapy genspider bigdataboy_spider bigdataboy.cn

0x02 设置爬虫

打开 settings.py 文件 ,取消下面代码的注释

# 开启  pipelines 功能
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'bigdataboy.pipelines.BigdataboyPipeline': 300,
}

# 添加 'User-Agent'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ''
}

0x03 编写网页的解析

打开 bigdataboy_spider.py 文件

有一个 def parse(self, response): 函数：这个函数的response是 Scrapy 框架爬取网页的相应返回值，它是一个scrapy.http.response.html.HtmlResponse对象，所以可以使用xpath，css… 提取数据

# 导入 item 模型
from bigdataboy.items import BigdataboyItem

def parse(self, response):
    # 使用 xpath 解析网页
    articleUrl = response.xpath('//div[@class="item-desc"]/a/@href').extract()
    item = BigdataboyItem(url=articleUrl)  # 使用 定义的 item 定义的传输参数进行传递
    yield item
    # 获取下一页的连接
    nextUrl = response.xpath('//*[@id="pagenavi"]/div/ol//*[@class="next"]/a/@href').get()
    # print(nextUrl)
    if nextUrl:
        # 传入连接，然后执行的函数
        yield scrapy.Request(nextUrl)
    else:
        return

0x04 定义 item 模型

打开 items.py 文件

class BigdataboyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    url = scrapy.Field()  # 自定义的模型

0x05 导出数据

打开 pipeline.py 文件

from scrapy.exporters import JsonItemExporter

class BigdataboyPipeline(object):
    def __init__(self):
        self.openFile = open("url.json","wb") # 需要使用二进制打开文件,因为在导出过程中是使用的 [字节] 方式
        # 传入文件打开对象
        self.exporter = JsonItemExporter(self.openFile)
        # 准备导出
        self.exporter.start_exporting()

    # 爬虫开始执行调用这个函数
    def open_spider(self,spider):
        print("爬虫开始执行")

    def process_item(self, item, spider):
        # 导出数据
        self.exporter.export_item(item)
        return item

    # 爬虫执行结束调用这个函数
    def close_spider(self,spider):
        # 完成导出
        self.exporter.finish_exporting()
        # 关闭文件打开
        self.openFile.close()
        print("爬虫执行完成")

0x06 运行爬虫

执行命令

在Pycharm的Terminal里执行

scrapy crawl bigdataboy_spider

查看运行结果

上一篇
Scrapy框架-CrawlSpider爬虫案例：爬取小程序社区文章

下一篇
Python爬虫项目：模拟登录正方系统

版权声明：《 Scarpy爬虫简单案例：爬取本网站首页链接》为明非原创文章，转载请注明出处！
最后编辑:2019-11-3 13:11:31

0x00 创建爬虫工程

0x01 创建爬虫项目

0x02 设置爬虫

0x03 编写网页的解析

0x04 定义 item 模型

0x05 导出数据

0x06 运行爬虫

执行命令

查看运行结果

相关推荐

HTML5 之 Input 标签

【AST 混淆】二、实现数组混淆 & 十六进制字符串

【项目总结】物联网课程实习项目总结

【Redis】事务

CSS3 之 标签定位扩展

CSS3 之标签定位扩展