Beautiful Soup
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ Beautiful Soup 4.2.0 文檔
http://www.imooc.com/learn/712 視頻課程:python遇見數據采集
https://segmentfault.com/a/1190000005182997 PyQuery的使用方法
import bs4 print(bs4.__version__) #當前版本是4.5.3 2017-4-6
安裝第三方庫
C:\Python3\scripts\> python pip.exe install bs4 (引入第三方庫 bs4 )——BeautifulSoup
C:\Python3\scripts\> python pip.exe install html5lib(引入第三方庫 html5lib )——html5解析器,BeautifulSoup要用到
打開本地的zzzzz.html文件,用 BeautifulSoup 解析
from urllib import request from bs4 import BeautifulSoup import html5lib #html5解析器 url='file:///C:/Python3/zz/zzzzz.html' resp = request.urlopen(url) html_doc = resp.read() soup = BeautifulSoup(html_doc,'lxml')#使用BeautifulSoup解析這段代碼。'lxml'是解析器,除此之外還有'html.parser'、'xml'、'html5lib'等 print(soup.prettify()) #按照標准的縮進格式的結構輸出
print(soup.title)#<title>標簽 print(soup.title.string)#<title>標簽的文字 print(soup.find(id="div111"))#查找id print(soup.find(id="div111").get_text())#獲得標簽內的所有文本內容文字 print(soup.find("p", {"class": "p444"}))#查找<p class="p444"></p>標簽 (這里的數據類型是 'bs4.element.Tag') print(soup.select('.p444'))#css選擇器!!! (這里的數據類型是 list) for tag1 in soup.select('.p444'): print(tag1.string) print(soup.select('.div2 .p222'))#css選擇器!!! print(soup.findAll("a"))#所有<a>標簽 for link in soup.findAll("a"): print(link.get("href")) print(link.string)
使用正則
import re data = soup.findAll("a",href=re.compile(r"baidu\.com")) for tag22 in data: print(tag22.get("href"))
練習1:解析一個網頁
由於win7上的編碼解碼問題搞不定,只好先使用標准html5的網頁了。先拿廖大的python教程頁做練習了,抓取左側的目錄
# -*- coding: utf-8 -*- from urllib import request from bs4 import BeautifulSoup import html5lib #html5解析器 url="http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000" resp = request.urlopen(url) html_doc = resp.read() #使用BeautifulSoup解析這段代碼。'lxml'是解析器,除此之外還有'html.parser'、'xml'、'html5lib'等 soup = BeautifulSoup(html_doc,'html.parser') #soup = BeautifulSoup(html_doc,'lxml') #按照標准的縮進格式的結構輸出 #print(soup.prettify()) f = open("c:\\Python3\zz\\0.txt","w+") for tag1 in soup.select('.x-sidebar-left-content li a'): #ss = tag1.get_text() ss = tag1.string ss2 = tag1.get("href") print(ss," --- ","http://www.liaoxuefeng.com",ss2) f.writelines(ss + " --- http://www.liaoxuefeng.com"+ss2+"\n")#寫入字符串 f.close()
2017-10-18:
http://www.cnblogs.com/zhaof/p/6930955.html 一些解析器(Beautiful Soup支持Python標准庫中的HTML解析器,還支持一些第三方的解析器,如果我們不安裝它,則 Python 會使用 Python默認的解析器,lxml 解析器更加強大,速度更快,推薦安裝。)
爬取 http://www.bootcdn.cn,獲得一個字典( dict["包名稱": star數值] ),存入文本文件: (一個想法,可以定期扒一次,例如3個月。再比對上次的dict記錄,觀察哪些項目的星升的比較快。比較受關注。)
#python 3.6.0 import requests #2.18.4 import bs4 #4.6.0 import html5lib url = "http://www.bootcdn.cn/" #url = "http://www.bootcdn.cn/all/" headers = {'User-Agent': 'mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/61.0.3163.100 safari/537.36'} r = requests.get(url,headers=headers) # r=requests.get(url) print(r.encoding) #獲得編碼 print(r.status_code) #獲得狀態碼 soup = bs4.BeautifulSoup(r.content.decode("utf-8"), "lxml") #'lxml'是解析器,除此之外還有'html.parser'、'xml'、'html5lib'等 #soup = bs4.BeautifulSoup(r.content, "html5lib") #aa = soup.decode("UTF-8", "ignore") #print(soup.prettify())#按照標准的縮進格式的結構輸出 #將數據解析成字典 element = soup.select('.packages-list-container .row') starsList = {} for item in element: # print(item.select("h4.package-name")) # print(item.select(".package-extra-info span")) # print(item.h4.text) # print(item.span.text) starsList[item.h4.text]=item.span.text print(starsList) #將字典存入文本文件 import time from datetime import datetime try: f = open('1.txt', 'a+') t2 = datetime.fromtimestamp(float(time.time())) f.write('\n'+str(t2)) f.write('\n'+str(starsList)) finally: if f: f.close()
爬取廖雪峰的python教程:(就是先用bs4解析左邊的目錄列表,拿到鏈接,存為字典,並保存到文本文件中。再扒取。) 共123條,但我只扒下28個文件
import requests import bs4 import urllib url="http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000" #r = requests.get(url) #這里不加header,不讓爬了 headers = {'User-Agent': 'mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/61.0.3163.100 safari/537.36'} r = requests.get(url,headers=headers) soup = bs4.BeautifulSoup(r.content.decode("utf-8"), "lxml") # 生成字典,並保存在文本文件中 f = open("c:\\Python3\\zz\\liaoxuefeng\\a.txt","w+") mylist = soup.select('#x-wiki-index .x-wiki-index-item') myhrefdict = {} for item in mylist: myhrefdict[item.text] = "https://www.liaoxuefeng.com" + item["href"] #print(item.text,item["href"]) #item.text tag1.string item["href"] item.get("href")。 #f.writelines(item.text + " --- http://www.liaoxuefeng.com"+item["href"]+"\n")#寫入字符串 f.write(str(myhrefdict)) f.close() # 爬取文件 i = 0 for key,val in myhrefdict.items(): i += 1 name = str(i) + '_' + key + '.html' link = val print(link,name) urllib.request.urlretrieve(link, 'liaoxuefeng\\' + name) # 提前要創建文件夾
Requests庫: 2017-10-30
http://www.python-requests.org/en/master/api/ Requests庫 API文檔
http://www.cnblogs.com/yan-lei/p/7445460.html Python網絡爬蟲與信息提取
requests.request() 構造一個請求,支撐以下各方法的基礎方法
requests.get(url, params=None, **kwargs) 獲取HTML網頁的主要方法,對應於HTTP的GET
requests.head(url, **kwargs) 獲取HTML網頁頭信息的方法,對應於HTTP的HEAD
requests.post(url, data=None, json=None, **kwargs) 向HTML網頁提交POST請求的方法,對應於HTTP的POST
requests.put(url, data=None, **kwargs) 向HTML網頁提交PUT請求的方法,對應於HTTP的PUT
requests.patch(url, data=None, **kwargs) 向HTML網頁提交局部修改請求,對應於HTTP的PATCH
requests.delete(url, **kwargs) 向HTML頁面提交刪除請求,對應於HTTP的DELET
代理: 2018-2-5
import requests proxies = { "http": "http://10.10.1.10:3128", "https": "http://10.10.1.10:1080", } requests.get("http://aaa.com", proxies=proxies)
https://www.v2ex.com/t/364904#reply0 帶大家玩一個練手的數據采集(簡潔版)
http://www.xicidaili.com/nn/ 高匿免費代理
...