jupyter notebook的主要特点
编程时具有语法高亮、缩进、tab补全的功能。
可直接通过浏览器运行代码,同时在代码块下方展示运行结果。
对代码编写说明文档或语句时,支持Markdown语法。
启动jupyter
jupyter notebook
快捷键
- 向上插入一个cell:a
- 向下插入一个cell:b
- 删除cell:x
- 将code切换成markdown:m
- 将markdown切换成code:y
- 运行cell:shift+enter
- 查看帮助文档:shift+tab
- 自动提示:tab
requests模块
pip install requests
作用:就是用来模拟浏览器上网的。
特点:简单,高效
old:urllib
requests模块的使用流程:
指定url
发起请求
获取响应数据
持久化存储
#爬取搜狗首页的页面数据 import requests #1指定url url = 'https://www.sogou.com/' #2.发起请求 response = requests.get(url=url) #3获取响应数据 page_text = response.text #text返回的是字符串类型的数据 #持久化存储 with open('./sogou.html','w',encoding='utf-8') as fp: fp.write(page_text) print('over!')
- 处理get请求的参数
- 需求:网页采集器
- 反爬机制:UA检测
- 反反爬策略:UA伪装
import requests wd = input('enter a word:') url = 'https://www.sogou.com/web' #参数的封装 param = { 'query':wd } #UA伪装 headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' } response = requests.get(url=url,params=param,headers=headers) #手动修改响应数据的编码 response.encoding = 'utf-8' page_text = response.text fileName = wd + '.html' with open(fileName,'w',encoding='utf-8') as fp: fp.write(page_text) print(fileName,'爬取成功!!!')
#破解百度翻译
url = 'https://fanyi.baidu.com/sug' word = input('enter a English word:') #请求参数的封装 data = { 'kw':word } #UA伪装 headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' } response = requests.post(url=url,data=data,headers=headers) #text:字符串 json():对象 obj_json = response.json() print(obj_json)
#爬取任意城市对应的肯德基餐厅的位置信息
#动态加载的数据 city = input('enter a cityName:') url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword' data = { "cname": "", "pid": "", "keyword": city, "pageIndex": "2", "pageSize": "10", } #UA伪装 headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' } response = requests.post(url=url,headers=headers,data=data) json_text = response.text print(json_text)
爬取北京肯德基所有的餐厅位置信息(1-8页)
http://www.kfc.com.cn/kfccda/storelist/index.aspx
import requests city = input("enter a cityName:") url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword" for i in range(1,9): data = { "cname": "", "pid": "", "keyword": city, "pageIndex": i, "pageSize": "10", } #UA伪装 headers = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" } response = requests.post(url=url,headers=headers,data=data) json_text = response.text print(json_text)
爬取豆瓣电影中更多的电影详情数据
https://movie.douban.com/typerank?type_name=%E5%8A%A8%E4%BD%9C&type=5&interval_id=100:90&action=
import requests for i in range(0,100): url = "https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start={}&limit=20".format(i) headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" } data = { "type": "5", "interval_id": "100:90", "action": "", "start": i, "limit": "20", } response = requests.get(url=url,data=data,headers=headers) json_text = response.json() print(json_text)
import requests url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" } for i in range(0,328): data = { "on": "true", "page": i, "pageSize": "15", "productName": "", "conditionType": "1", "applyname":"" , "applysn":"" } response = requests.post(url=url,data=data,headers=headers) json_text = response.json() for i in json_text["list"]: urls = "http://125.35.6.84:81/xk/itownet/portal/dzpz.jsp?id=" + i["ID"] datas = { "id": i["ID"] } responses = requests.post(url=urls,data=datas,headers=headers) urlss = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById" responsess = requests.post(url=urlss,data=datas,headers=headers) json_texts = responsess.json() print(json_texts)