python庫：bs4，BeautifulSoup庫、Requests庫

本文轉載自查看原文 2017-04-06 22:03 1838 python3/ 代碼自用/ 學習筆記

Beautiful Soup

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/　　Beautiful Soup 4.2.0 文檔

http://www.imooc.com/learn/712　　視頻課程：python遇見數據采集

https://segmentfault.com/a/1190000005182997　　PyQuery的使用方法

import bs4
print(bs4.__version__)    #當前版本是4.5.3　　2017-4-6

安裝第三方庫

C:\Python3\scripts\> python pip.exe install bs4 （引入第三方庫 bs4 ）——BeautifulSoup
C:\Python3\scripts\> python pip.exe install html5lib（引入第三方庫 html5lib ）——html5解析器，BeautifulSoup要用到

打開本地的zzzzz.html文件，用 BeautifulSoup 解析

from urllib import request
from bs4 import BeautifulSoup
import html5lib #html5解析器
url='file:///C:/Python3/zz/zzzzz.html'
resp = request.urlopen(url)
html_doc = resp.read()
soup = BeautifulSoup(html_doc,'lxml')#使用BeautifulSoup解析這段代碼。'lxml'是解析器，除此之外還有'html.parser'、'xml'、'html5lib'等
print(soup.prettify()) #按照標准的縮進格式的結構輸出

print(soup.title)#<title>標簽
print(soup.title.string)#<title>標簽的文字

print(soup.find(id="div111"))#查找id
print(soup.find(id="div111").get_text())#獲得標簽內的所有文本內容文字

print(soup.find("p", {"class": "p444"}))#查找<p class="p444"></p>標簽 （這里的數據類型是 'bs4.element.Tag'）
print(soup.select('.p444'))#css選擇器!!! （這里的數據類型是 list）
for tag1 in soup.select('.p444'):
    print(tag1.string)

print(soup.select('.div2 .p222'))#css選擇器!!!
print(soup.findAll("a"))#所有<a>標簽
for link in soup.findAll("a"):
    print(link.get("href"))
    print(link.string)

使用正則

import re
data = soup.findAll("a",href=re.compile(r"baidu\.com"))
for tag22 in data:
    print(tag22.get("href"))

練習1：解析一個網頁

由於win7上的編碼解碼問題搞不定，只好先使用標准html5的網頁了。先拿廖大的python教程頁做練習了，抓取左側的目錄

# -*- coding: utf-8 -*-
from urllib import request
from bs4 import BeautifulSoup
import html5lib #html5解析器

url="http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000"
resp = request.urlopen(url)
html_doc = resp.read()

#使用BeautifulSoup解析這段代碼。'lxml'是解析器，除此之外還有'html.parser'、'xml'、'html5lib'等
soup = BeautifulSoup(html_doc,'html.parser')
#soup = BeautifulSoup(html_doc,'lxml')

#按照標准的縮進格式的結構輸出
#print(soup.prettify())

f = open("c:\\Python3\zz\\0.txt","w+")
for tag1 in soup.select('.x-sidebar-left-content li a'):
    #ss = tag1.get_text()
    ss = tag1.string
    ss2 = tag1.get("href")

    print(ss," --- ","http://www.liaoxuefeng.com",ss2)
    f.writelines(ss + " --- http://www.liaoxuefeng.com"+ss2+"\n")#寫入字符串

f.close()

2017-10-18：

http://www.cnblogs.com/zhaof/p/6930955.html　　一些解析器（Beautiful Soup支持Python標准庫中的HTML解析器,還支持一些第三方的解析器，如果我們不安裝它，則 Python 會使用 Python默認的解析器，lxml 解析器更加強大，速度更快，推薦安裝。）

爬取 http://www.bootcdn.cn，獲得一個字典（ dict["包名稱": star數值] ），存入文本文件：　　（一個想法，可以定期扒一次，例如3個月。再比對上次的dict記錄，觀察哪些項目的星升的比較快。比較受關注。）

#python 3.6.0
import requests     #2.18.4
import bs4          #4.6.0
import html5lib
url = "http://www.bootcdn.cn/"
#url = "http://www.bootcdn.cn/all/"
 headers = {'User-Agent': 'mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/61.0.3163.100 safari/537.36'} r = requests.get(url,headers=headers) # r=requests.get(url)
print(r.encoding)   #獲得編碼
print(r.status_code)    #獲得狀態碼


soup = bs4.BeautifulSoup(r.content.decode("utf-8"), "lxml")    #'lxml'是解析器，除此之外還有'html.parser'、'xml'、'html5lib'等
#soup = bs4.BeautifulSoup(r.content, "html5lib")
#aa = soup.decode("UTF-8", "ignore")
#print(soup.prettify())#按照標准的縮進格式的結構輸出


#將數據解析成字典
element = soup.select('.packages-list-container .row')
starsList = {}
for item in element:
    # print(item.select("h4.package-name"))
    # print(item.select(".package-extra-info span"))
    # print(item.h4.text)
    # print(item.span.text)
    starsList[item.h4.text]=item.span.text
print(starsList)

#將字典存入文本文件
import time
from datetime import datetime
try:
    f = open('1.txt', 'a+')
    t2 = datetime.fromtimestamp(float(time.time()))
    f.write('\n'+str(t2))
    f.write('\n'+str(starsList))
finally:
    if f:
        f.close()

爬取廖雪峰的python教程：（就是先用bs4解析左邊的目錄列表，拿到鏈接，存為字典，並保存到文本文件中。再扒取。）　　共123條，但我只扒下28個文件

import requests
import bs4
import urllib
url="http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000"
#r = requests.get(url) #這里不加header，不讓爬了
headers = {'User-Agent': 'mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/61.0.3163.100 safari/537.36'}
r = requests.get(url,headers=headers)
soup = bs4.BeautifulSoup(r.content.decode("utf-8"), "lxml")

# 生成字典，並保存在文本文件中
f = open("c:\\Python3\\zz\\liaoxuefeng\\a.txt","w+")
mylist = soup.select('#x-wiki-index .x-wiki-index-item')
myhrefdict = {}
for item in mylist:
    myhrefdict[item.text] = "https://www.liaoxuefeng.com" + item["href"]
    #print(item.text,item["href"])       #item.text   tag1.string     item["href"]   item.get("href")。
    #f.writelines(item.text + " --- http://www.liaoxuefeng.com"+item["href"]+"\n")#寫入字符串
f.write(str(myhrefdict))
f.close()

# 爬取文件
i = 0
for key,val in myhrefdict.items():
    i += 1
    name = str(i) + '_' + key + '.html'
    link = val
    print(link,name)
    urllib.request.urlretrieve(link, 'liaoxuefeng\\' + name)    # 提前要創建文件夾

Requests庫：　　2017-10-30

http://www.python-requests.org/en/master/api/　　Requests庫 API文檔

http://www.cnblogs.com/yan-lei/p/7445460.html　　Python網絡爬蟲與信息提取

requests.request()　　構造一個請求，支撐以下各方法的基礎方法
requests.get(url, params=None, **kwargs) 獲取HTML網頁的主要方法，對應於HTTP的GET
requests.head(url, **kwargs) 獲取HTML網頁頭信息的方法，對應於HTTP的HEAD
requests.post(url, data=None, json=None, **kwargs) 向HTML網頁提交POST請求的方法，對應於HTTP的POST
requests.put(url, data=None, **kwargs) 向HTML網頁提交PUT請求的方法，對應於HTTP的PUT
requests.patch(url, data=None, **kwargs) 向HTML網頁提交局部修改請求，對應於HTTP的PATCH
requests.delete(url, **kwargs) 向HTML頁面提交刪除請求，對應於HTTP的DELET

代理：　　2018-2-5

import requests

proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "http://10.10.1.10:1080",
}

requests.get("http://aaa.com", proxies=proxies)

https://www.v2ex.com/t/364904#reply0　　帶大家玩一個練手的數據采集（簡潔版）

http://www.xicidaili.com/nn/　　高匿免費代理

...

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python關於bs4庫的整理【Python 庫】bs4的使用 bs4解析庫 BS4庫詳解 Python BS4庫的安裝與使用詳解 python bs4 BeautifulSoup Python之解BS4庫如何安裝與使用？正確方法教你 pyhon---信息的爬取與提取---bs4，BeautifulSoup，re庫 $python爬蟲系列（2）—— requests和BeautifulSoup庫的基本用法基於bs4庫的HTML內容查找方法