本文共 5520 字,大约阅读时间需要 18 分钟。
开始爬取
创建Spider,上一篇我们已经创建了ImoocSpider,我们做一下修改,可以连续下一页爬取。 scrapyDemo/spiders目录下的ImoocSpider类:
# -*- coding: utf-8 -*-import scrapyfrom urllib import parse as urlparsefrom scrapyDemo.ImoocCourseItem import ImoocCourseItem# 慕课网爬取class ImoocSpider(scrapy.Spider): # spider的名字定义了Scrapy如何定位(并初始化)spider,所以其必须是唯一的 name = "imooc" # URL列表 start_urls = ['http://www.imooc.com/course/list'] # 域名不在列表中的URL不会被爬取。 allowed_domains = ['www.imooc.com'] def parse(self, response): learn_nodes = response.css('a.course-card') item = ImoocCourseItem() # 遍历该页上所有课程列表 for learn_node in learn_nodes: course_url = learn_node.css("::attr(href)").extract_first() # 拼接课程详情页地址 course_url = urlparse.urljoin(response.url, course_url) # 课程地址 item['course_url'] = course_url # 课程图片 item['image'] = learn_node.css( "img.course-banner::attr(src)").extract_first() # 进入课程详情页面 yield scrapy.Request( url=course_url, callback=self.parse_learn, meta=item) # 下一页地址 next_page_url = response.css( u'div.page a:contains("下一页")::attr(href)').extract_first() if next_page_url: yield scrapy.Request( url=urlparse.urljoin(response.url, next_page_url), callback=self.parse) def parse_learn(self, response): item = response.meta # 课程标题 item['title'] = response.xpath( '//h2[@class="l"]/text()').extract_first() # 课程简介 item['brief'] = response.xpath( '//div[@class="course-brief"]/p/text()').extract_first() yield item
这里用到了scrapyDemo目录下ImoocCourseItem类,下面我就说一下。
Item数据容器
在scrapyDemo目录下创建ImoocCourseItem.py,这个类就是我们用了保存数据的容器,我们定义了标题、图片、简介、地址。 scrapyDemo目录下ImoocCourseItem类:
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ImoocCourseItem(scrapy.Item): # define the fields for your item here like: title = scrapy.Field() # cate = scrapy.Field() image = scrapy.Field() # desc = scrapy.Field() brief = scrapy.Field() # cate = scrapy.Field() course_url = scrapy.Field() pass
Pipeline管道
Pipeline是用来处理抓取到的数据,我们在scrapyDemo目录下pipelines.py文件添加ScrapydemoPipeline类
# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlfrom scrapyDemo.db.dbhelper import DBHelperclass ScrapydemoPipeline(object): # 连接数据库 def __init__(self): self.db = DBHelper() def process_item(self, item, spider): # 插入数据库 self.db.insert(item) return item
别忘了在配置文件中开启管道哦,scrapyDemo目录下的settings.py文件中,找到下ITEM_PIPELINES,修改为
ITEM_PIPELINES = { 'scrapyDemo.pipelines.ScrapydemoPipeline': 300,}
数据库操作
这里面我们用到了数据库的操作DBHelper类,那么我们在scrapyDemo/db目录下创建dbhelper.py 模块,记得再创建一个init.py哦。
# -*- coding: utf-8 -*-import pymysqlfrom twisted.enterprise import adbapifrom scrapy.utils.project import get_project_settings #导入seetings配置import timeclass DBHelper(): '''这个类也是读取settings中的配置,自行修改代码进行操作''' def __init__(self): settings = get_project_settings() #获取settings配置,设置需要的信息 dbparams = dict( host=settings['MYSQL_HOST'], #读取settings中的配置 db=settings['MYSQL_DBNAME'], user=settings['MYSQL_USER'], passwd=settings['MYSQL_PASSWD'], port=settings['MYSQL_PORT'], charset='utf8', #编码要加上,否则可能出现中文乱码问题 cursorclass=pymysql.cursors.DictCursor, use_unicode=False, ) #**表示将字典扩展为关键字参数,相当于host=xxx,db=yyy.... dbpool = adbapi.ConnectionPool('pymysql', **dbparams) self.dbpool = dbpool def connect(self): return self.dbpool #创建数据库 def insert(self, item): sql = "insert into tech_courses(title,image,brief,course_url,created_at) values(%s,%s,%s,%s,%s)" #调用插入的方法 query = self.dbpool.runInteraction(self._conditional_insert, sql, item) #调用异常处理方法 query.addErrback(self._handle_error) return item #写入数据库中 def _conditional_insert(self, tx, sql, item): item['created_at'] = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())) params = (item["title"], item['image'], item['brief'], item['course_url'], item['created_at']) tx.execute(sql, params) #错误处理方法 def _handle_error(self, failue): print('--------------database operation exception!!-----------------') print(failue)
这里用到了pymysql和adbapi,adbapi是python的数据库连接池,可以pip安装:
pip install pymysqlpip install Twisted
这里面还用到了get_project_settings方法,意思是从配置文件settings.py里边获取数据库配置信息,我们在scrapyDemo目录下的settings.py文件最后加入数据库信息
#Mysql数据库的配置信息MYSQL_HOST = '192.168.6.1'MYSQL_DBNAME = 'scrapy_demo_db' #数据库名字,请修改MYSQL_USER = 'root' #数据库账号,请修改 MYSQL_PASSWD = 'abc-123' #数据库密码,请修改MYSQL_PORT = 3306 #数据库端口,在dbhelper中使用
建表语句如下:
DROP TABLE IF EXISTS `tech_courses`;CREATE TABLE `tech_courses` ( `id` int(11) NOT NULL AUTO_INCREMENT, `title` varchar(255) DEFAULT NULL, `image` varchar(255) DEFAULT NULL, `brief` varchar(255) DEFAULT NULL, `course_url` varchar(255) DEFAULT NULL, `created_at` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8mb4;
大功告成
转载地址:http://nmuws.baihongyu.com/