Scrapy爬虫框架.pdf通过对scrapy框架的几大组成模型通俗细致的讲解，让大家可以非常清楚地

文件名称: Scrapy爬虫框架.pdf

所属分类: Python

开发工具:

文件大小: 1014kb

下载次数: 0

上传时间: 2019-07-01

提供者: yan****

下载 (1014kb)

不能下载？报告错误

详细说明：通过对scrapy框架的几大组成模型通俗细致的讲解，让大家可以非常清楚地理解scrapy框架的整体工作流程。Scheduler nternet 调度器) (网络) Requests (请求) Item Pipeline Scrap Engine Downloader (数据管道) (引擎) (下载器) Downloader Middlewares Requests (下载中间件) (请求) Items 数据) Spider Middlewares 爬虫中间件) Responses (回应) Spiders (爬虫) scrap中每个模块的作用引擎 engine):负责数据和信号在不同模块间的传递(不扫教据关处理,只是息的运) 调度器( scheduler):实现一个队列,存放引擎发过来的 request请求对象把请求放进列) 下载器( downloader):发送引擎发过来的 request请求,获取响应,并将响应交给引擎(发起凊求,获硇应) 爬虫 spider):处理引擎来的 response,提取数据,提取url(构建成诮求),并交给引擎(拿到卿应, 碗应的女容) ·管道( pipeline):处理引擎传递过来的数据,比如存储(炫你所要的式存教) 下载中间件( downloader middleware):可以自定义的下载扩展,比如设置代理ip(自定义你所发起的求) ·爬虫中间件( spider middleware):可以自定义 request请求和进行 response过滤过流漬求与硇应三、如何运用 Scrap(how?怎样使用 Scrap框架?) 1、安装 Scrap: pip install scrap 2、创建 Scrap项目 scrapy startproject myspider# myspi der为自定义项目名 File Edit View Navigate Code Refactor Run Tool srcapy_test2 D Projec e中|b ev srcapy. test2 E: \srcapy-test2 myspider id spIders iinit init items. py i middlewares. py i pipelines.py i settings. py 自 scrap.cf lll External Libraries 3、刨建 scrap爬虫:在项目目录下执行 scrapy genspider itcast itcast cn B- Project srcapy test2 E: \srcapy- test2 spider pp iders .oy Recast. py te init Fe items. py middlewares. py pipelines. py 自 rapy ctg mI External Libraries 4、手写 spider,以实现数据的提取等操作 scrapy b myspider myspider D spiders ie itcastpy D Project 6+14.I- 6 itcast py i pipelines. py t&settings py x f- items.py scrap tc為t1 perpare(0 a myspider r tcast py 导入My=p1der1tem from myspider items import MyspiderItem i itcast_ pipeline json class Itcastspider(scrapy. Spider) 母_nit 吧虫名称,运行吧中的时候需要用到,必须唯一 re items. py 允许肥取的域名,防止肥虫爬到其他网站上了 middlewares. py re pipelines. py 起始的列表,爬虫从这些U开始爬职 re settings. py starturls-['http://www.itcaston/channel/teacher.shtml'i E scrap.cf >Ill External Libraries pider中的Par方法必须有 response直接可以通过 xpath方法提取数据 s names- response xpath('//divIeclass-"11 txt"]/h3/text()') t print(names) 先分组,获职包含老师信息的过v列表 divs reaponse xpath('//divteclass-"li txt"') 遍历ds,获段每一个为讲师信息 for div in divs --2.创建M甲 piderIt对象 item- MyapiderItem() extract fir9t()返回列表中的第一个字符串,列表为空没有返回cne 5、手写 pipelines,py、 items.py文件处理引擎传过来的数据 scrap a myspider myspider i pipelines. py D Proect 6中1命, toast.pyx i. pipelines G settings.py x i ■ scrap E:scrap 7 Bmyspider t class MyspiderPipeline (object) a spiders t def proceas isem(aclf, item, spider): Le__-py i itcast -py toast_pipeline sc import son Ea teachers,csv i unity class ItcastPipeline (obect) re items.py 4爬虫文件中提取数据的方法每y1e1d次iem,就会运行一次 te middlewares. py nIt t pipelines py open("toast pipeline jon","wb" Ie settings.py E scraps. cfg def process item(elf, item, spider) content=n.dmpt(at(item), ensure_ ascii-Fa1e)+,”9n.dm(是将字典转化为字串 self.f write(content, encode("utc-8" )) return⊥tem def close spider(self, spider) ile Edit View Navigate Code Refactor Run Tools VCs Window Help scrap B myspider myspider) ig items py DProject 6*I*.I+ 4 itcastpy x i- pipelines. pyx t settings. py x te items.py □ scrap E: scrap MyspiderItem a spider coding: utf-8 m a spiders f Define here the models for your scraped items rinit py See documentation in ie itcast py 6 #https://doc.scrapy.org/en/latest/topics/items.html io itcast_ pipeline json teachers. csy mport scrap t_init_- ⊥0 ie items. py class MyspiderItem(scrap Item) e middlewares. py g define the fields for your item here like i pig 13 name= scrap. Field( 自 scrap.cfg 15 title scrap Field( Ill External Libraries desc scrap Field( 19 6、在 settings.py设置开启 pipeline scrapy B myspider myspider te settings.py 6中|,"得 tcastpyx t pipelines. py x t settings.py×最 items. py v scrap E\scapy v B myspider +47我 Maich CaseRegexwo a spiders Enable or disable extensions fSeehttps://doc.scrapy.org/en/latest/topica/extensicns.html re toast py #EXTENSIONS- i itcast_ pipeline. son crapyextensions.telnet. relnetconsole': None El teachers. csv 9 Configure item pipelines r iter fSeehttps://doc.scrapy,org/en/latest/topics/item-pipeline.html 得 middlewares. py r pipelines. py ig settings. py 工 EV PTPEIINZ3= A scrap. cfg s'myapide= pipelines, MyspiderPipeline': 300, ema ITEM PIPELINES =1 键(key)完整类名:摸块,类名值优先级1):是一个0100整数,越小越先执行 Typo:In word Itcast more.(Ctrl+F1) Enable and configure the AutoThrottle extension (disabled by default) crapy. org/en/latest/topics/autothrottle.hta 求 AUTOTHROTTLE rui s The maximum download delay to be set in case of high latencies 7、运行 scrap爬虫:在项目目录下执行 scrapy crawl itcast ●特殊地,如想输岀为某种特定格式文件,在运行爬虫的命令后使用-σ选项可以输岀指定格式的文件 #输出CSV格式,使用逗号表达式,可用EXce1打开 scrapy crawI itcast -o teachers.csv #输出XML格式 scrapy crawl itcast -o teachers. xmI 实践小案例:爬取黑马师资(资历、姓名、职称)

(系统自动生成,下载前可以参看下载内容)