python之（urllib、urllib2、lxml、Selenium+PhantomJS）爬蟲

本文轉載自查看原文 2019-07-09 17:19 401 python

　　一、最近在學習網絡爬蟲的東西，說實話，沒有怎么寫過爬蟲，Java里面使用的爬蟲也沒有怎么用過。這里主要是學習Python的時候，了解到Python爬蟲的強大，和代碼的簡介，這里會簡單的從入門看是說起，主要是了解基本的開發思路，后續會講到scrapy框架的使用，這里主要是講Python的爬蟲入門。

　　二、urllib、urllib2，這兩個模塊都是用來處理url請求的，這里的開始就是使用urllib和urllib2的庫進行相關操作，來看一個例子：

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import urllib
import urllib2

# 需要爬取的連接
url = "http://www.baidu.com"
# 模擬的瀏覽器
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
# 表單數據
form_data = {
        "start": "0",
        "end": "20"
    }
# 編碼
data = urllib.urlencode(form_data)
# 設定請求
request = urllib2.Request(url, data=data, headers=headers)
# 訪問獲取結果
html = urllib2.urlopen(request).read()
print html

　　說明：這里獲取結果的方式，是通過代碼去模擬瀏覽器，來達到訪問的目的。這里的form_data 只是一個模擬ajax的請求，沒有太大的用處。

　　注意：urllib2.Request中如果存在data數據則是POST請求，反之GET

　　三、上面我們獲取到了結果，接下來就是解析，lxml是常用的一中解析方式,當然還存在其他解析的方式比如re，這里不詳細介紹：

　　1）在說解析之前，講一下urllib2的handler:

　　a、handler種類：handler分很多種，比如：cookie，proxy，auth、http等。

　　b、為什么使用handler：cookie處理回話的保存問題；proxy處理ip代理，訪問ip被封；auth認證處理；http處理器相關方式；

　　c、處理器比較常見，在會話或者代理都有很好的應用

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import urllib
import urllib2
import cookielib

# 通過CookieJar()類構建一個cookieJar()對象，用來保存cookie的值
cookie = cookielib.CookieJar()
# 通過HTTPCookieProcessor()處理器類構建一個處理器對象，用來處理cookie
# 參數就是構建的CookieJar()對象
cookie_handler = urllib2.HTTPCookieProcessor(cookie)
# 代理handler
# httpproxy_handler = urllib2.ProxyHandler({"http" : "ip:port"})
# 認證代理handler
# authproxy_handler = urllib2.ProxyHandler({"http" : "username:password@ip:port"})
# auth
# passwordMgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
# passwordMgr.add_password(None, ip, username, password)
# httpauth_handler = urllib2.HTTPBasicAuthHandler(passwordMgr)
# proxyauth_handler = urllib2.ProxyBasicAuthHandler(passwordMgr)
# opener = urllib2.build_opener(httpauth_handler, proxyauth_handler)
# 構建一個自定義的opener
opener = urllib2.build_opener(cookie_handler)
# 通過自定義opener的addheaders的參數，可以添加HTTP報頭參數
opener.addheaders = [("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36")]
# renren網的登錄接口
url = "http://www.renren.com/PLogin.do"
# 需要登錄的賬戶密碼
data = {"email":"郵箱", "password":"密碼"}
# 通過urlencode()編碼轉換
data = urllib.urlencode(data)
# 第一次是post請求，發送登錄需要的參數，獲取cookie
request = urllib2.Request(url, data = data)
# 發送第一次的post請求，生成登錄后的cookie(如果登錄成功的話)
response = opener.open(request)
#print response.read()
# 第二次可以是get請求，這個請求將保存生成cookie一並發到web服務器，服務器會驗證cookie通過
response_deng = opener.open("http://www.renren.com/410043129/profile")
# 獲取登錄后才能訪問的頁面信息
html = response_deng.read()
print html

　　2）解析（lxml）:

# !/usr/bin/python
# -*- coding: UTF-8 -*-
import urllib2
from lxml import etree

if __name__ == '__main__':
    # url = raw_input("請輸入需要爬取圖片的鏈接地址：")
    url = "https://www.baidu.com"
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}
    # 讀取到的頁面
    html = urllib2.urlopen(url).read()
    # 使用lxml的etree
    content = etree.HTML(html) # XPATH的寫法
    img_list = content.xpath("//img/@src") # 對圖片連接處理
    for link in img_list:
        try:
            if link.find("https") == -1:
                link = "https:" + link
            img = urllib2.urlopen(link).read()
            str_list = link.split("/")
            file_name = str_list[len(str_list) - 1]
            with open(file_name, "wb") as f:
                f.write(img)
        except Exception as err:
            print err

　　xpath的語法參考：http://www.w3school.com.cn/xpath/xpath_syntax.asp

　　3）另外一種使用方式（BeautifulSoup）：

# !/usr/bin/python
# -*- coding: UTF-8 -*-
import urllib2

from bs4 import BeautifulSoupif __name__ == '__main__':
    url = "https://tieba.baidu.com/index.html"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}
    # 讀取到的頁面
    html = urllib2.urlopen(url).read()
    bs = BeautifulSoup(html, "lxml")
    img_list = bs.find_all("img", attrs={"class": ""})
    # 對圖片連接處理
    for img in img_list:
        try:
            link = img.get("src")
            if link.find("https") == -1:
                link = "https:" + link
            img = urllib2.urlopen(link).read()
            str_list = link.split("/")
            file_name = str_list[len(str_list) - 1]
            with open(file_name, "wb") as f:
                f.write(img)
        except Exception as err:
            print err

　　四、Selenium+PhantomJS，上面的都是針對於靜態的html文件進行的爬蟲，但是現在的網站一般都是通過ajax動態的加載數據，這里就產生了一個問題，我們必須先把js加載完成，才能進行下一步的爬蟲工作。這里也就產生了對應的框架，來做這一塊的爬蟲工作。

from selenium import webdriver
from bs4 import BeautifulSoup

if __name__ == '__main__':
    # 使用谷歌的瀏覽器
    driver = webdriver.Chrome()
    # 獲取頁面
    driver.get("https://www.douyu.com/directory/all")
    while True:
        # 確認解析方式
        bs = BeautifulSoup(driver.page_source, "lxml")
        # 找到對應的直播間名稱
        title_list = bs.find_all("h3", attrs={"class": "DyListCover-intro"})
        # 直播間熱度
        hot_list = bs.find_all("span", attrs={"class": "DyListCover-hot"})
        # 壓縮成一個循環
        for title, hot in zip(title_list, hot_list):
            print title.text, hot.text
        # 執行下一頁
        if driver.page_source.find("dy-Pagination-next") == -1:
            break
        driver.find_element_by_class_name("dy-Pagination-next").click()
    # 退出
    driver.quit()

　　備注：我這里使用的是chrome的瀏覽器來執行的操作，目前因為PhantomJS已經被放棄了，不建議使用

　　chromedriver.exe的下載需要到谷歌網站上下載（需要翻牆），我這里提供一個75版本的下載

　　淘寶鏡像下載地址：https://npm.taobao.org/mirrors/chromedriver/

　　下面提供一種針對於圖片后加載的處理：

import urllib2

import time
from selenium import webdriver

if __name__ == '__main__':
    # 使用谷歌的瀏覽器driver，需要chromedriver.exe支持（放在文件同目錄下）
    # 項目運行時記得讓瀏覽器全屏
    driver = webdriver.Chrome()
    # 爬取網址
    driver.get("https://www.douyu.com/directory/all")
    while True:
        # 下一頁后等待加載
        time.sleep(2)
        # 屏幕滾動（這里的值，更具具體頁面設置）
        for i in xrange(1, 11):
            js = "document.documentElement.scrollTop=%d" % (i * 1000)
            driver.execute_script(js) # 等待頁面加載完成
            time.sleep(1)
        # 獲取頁面的圖片（xpath方式）
        img_list = driver.find_elements_by_xpath("//div[@class='LazyLoad is-visible DyImg DyListCover-pic']/img")
        # 對圖片連接處理
        for img in img_list:
            try:
                link = img.get_attribute("src")
                str_list = link.split("/")
                file_name = str_list[len(str_list) - 2]
                print file_name
                # 讀取圖片，寫入本地
                img_data = urllib2.urlopen(link).read()
                with open("img/" + file_name, "wb") as f:
                    f.write(img_data)
            except Exception as err:
                print err
        # 查看是否存在下一頁
        if driver.page_source.find("dy-Pagination-next") == -1:
            break
        # 如果存在則跳轉至下一頁
        driver.find_element_by_class_name("dy-Pagination-next").click()
    # 退出
    driver.close()

　　說明：這里只是針對斗魚的圖片進行的爬蟲，其他頁面需要進一步修改，好了js的處理包括頁面需要爬取的圖片基本上就是這樣了。

　　該代碼只是用於學習和嘗試，不得用於其他作用

　　五、好了這基本上，算的上入門了吧，當然你要爬取一個完整的東西，還是需要很多功夫的，我這里只是介紹基本上常用的一些庫，和我自己測試使用的一些代碼，以及目前涉及的不懂的地方，僅供學習吧，有什么錯誤的地方還請指出。我好及時改正！！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python爬蟲-----基於urllib,urllib2,re python爬蟲入門（一）urllib和urllib2 Python urllib與urllib2 Python爬蟲(二)_urllib2的使用 Python 爬蟲 urllib、urllib2、urllib3用法及區別 Python的urllib和urllib2模塊 python爬蟲(七)_urllib2：urlerror和httperror Python爬蟲基礎（一）urllib2庫的基本使用 Python爬蟲(三)_urllib2:get和post請求 python爬蟲(五)_urllib2：urlerror和httperror