Python爬蟲-02 request模塊爬取妹子圖網站

本文轉載自查看原文 2019-07-04 10:24 4704 Python爬蟲

簡介

#介紹：使用requests可以模擬瀏覽器的請求，比起之前用到的urllib，requests模塊的api更加便捷（本質就是封裝了urllib3）

#注意：requests庫發送請求將網頁內容下載下來以后，並不會執行js代碼，這需要我們自己分析目標站點然后發起新的request請求

#安裝：pip3 install requests

在pycharm中操作：

import requests   #導入模塊

def run():        #聲明一個run方法
    print("跑碼文件")    #打印內容

if __name__ == "__main__":   #主程序入口
    run()    #調用上面的run方法

顯示如下結果，代表編譯沒有問題

跑碼文件

接下來，我們開始測試requests模塊是否可以使用

修改上述代碼中的

import requests

def run():
    response = requests.get("http://www.baidu.com")
    print(response.text)

if __name__ == "__main__":
    run()

運行結果（出現下圖代表你運行成功了）：

<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç¾åº¦ä¸ä¸ï¼ä½ å°±ç¥é</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç¾åº¦ä¸ä¸ class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ°é»</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å°å¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§é¢</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç»å½</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">ç»å½</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ´å¤äº§å</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å³äºç¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>ä½¿ç¨ç¾åº¦åå¿è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æè§åé¦</a>&nbsp;äº¬ICPè¯030173å·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

接下來，我們實際下載一張圖片試試，比如下面這張圖片

圖片鏈接：https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1562215040437&di=aa3fc27e7acb5ded2643b315497cfce2&imgtype=0&src=http%3A%2F%2Fimg.9ku.com%2Fgeshoutuji%2Fsingertuji%2F4%2F4779%2F4779_9.jpg

import requests

def run():
    response = requests.get("https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1562215040437&di=aa3fc27e7acb5ded2643b315497cfce2&imgtype=0&src=http%3A%2F%2Fimg.9ku.com%2Fgeshoutuji%2Fsingertuji%2F4%2F4779%2F4779_9.jpg")
    with open("Alizee.jpg","wb") as f :
        f.write(response.content)
        f.close

if __name__ == "__main__":
    run()

運行代碼之后，發現在文件夾內部生成了一個文件

打開文件之后發現，圖片顯示正常，說明圖片爬取成功。

我們繼續修改代碼，因為有的服務器圖片，都做了一些限制，我們可以用瀏覽器打開，但是使用Python代碼並不能完整的下載下來。

修改代碼，加入請求頭

import requests

def run():
    # 頭文件，header是字典類型
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.5383.400 QQBrowser/10.0.1313.400"
    }
    response = requests.get(“https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1562215040437&di=aa3fc27e7acb5ded2643b315497cfce2&imgtype=0&src=http%3A%2F%2Fimg.9ku.com%2Fgeshoutuji%2Fsingertuji%2F4%2F4779%2F4779_9.jpg”,headers=headers) 
    with open("Alizee.jpg","wb") as f :
        f.write(response.content)   
        f.close

if __name__ == "__main__":
    run()

重點查看上述代碼中 requests.get部分，添加了一個headers的實參。這樣我們程序就下載下來了完整的圖片。

Python爬蟲頁面分析

我們今天要爬的網站叫做 http://www.umei.cc/bizhitupian/meinvbizhi

當然，部分圖片尺度較大，請自我屏蔽

import requests

all_urls = []  # 我們拼接好的圖片集和列表路徑

class Spider():
    # 構造函數，初始化數據使用
    def __init__(self, target_url, headers):
        self.target_url = target_url
        self.headers = headers

    # 獲取所有的想要抓取的URL
    def getUrls(self, start_page, page_num):
        global all_urls
        # 循環得到URL
        for i in range(start_page, page_num + 1):
            url = self.target_url % i
            all_urls.append(url)


if __name__ == "__main__":
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
        'HOST': 'www.umei.cc',
    }
    target_url = "http://www.umei.cc/bizhitupian/meinvbizhi/%d.htm"  # 圖片集和列表規則

    spider = Spider(target_url, headers)
    spider.getUrls(1, 16)
    print(all_urls)

可以看到所有的url地址，存放在all_urls列表中

['http://www.umei.cc/bizhitupian/meinvbizhi/1.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/2.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/3.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/4.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/5.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/6.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/7.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/8.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/9.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/10.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/11.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/12.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/13.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/14.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/15.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/16.htm']

上面的代碼，可能需要有一定的Python基礎可以看懂，不過你其實仔細看一下，就幾個要點

第一個是 class Spider(): 我們聲明了一個類,然后我們使用 def __init__去聲明一個構造函數，這些我覺得你找個教程30分鍾也就學會了。

拼接URL，我們可以用很多辦法，我這里用的是最直接的，字符串拼接。

注意上述代碼中有一個全局的變量 all_urls 我用它來存儲我們的所有分頁的URL，這里就是我們接下來下載網頁的url地址

接下來，是爬蟲最核心的部分代碼了

我們需要分析頁面中的邏輯。首先打開 http://www.umei.cc/bizhitupian/meinvbizhi/1.htm ，右鍵審查元素

分析源代碼可知，所有圖片資源都在li標簽里面

接下來爬取每張圖片里面的title和圖片鏈接

這里我們采用多線程的方式爬取（這里還用了一種設計模式，叫觀察者模式）

import threading   #多線程模塊
from lxml import etree #lxml模塊
import time #時間模塊

新增加一個全局的變量，並且由於是多線程操作，我們需要引入線程鎖

all_img_urls = []       #圖片列表頁面的數組
g_lock = threading.Lock()  #初始化一個鎖

聲明一個生產者的類，用來不斷的獲取圖片詳情頁地址，然后添加到 all_img_urls 這個全局變量中

# 生產者，負責從每個頁面提取圖片列表鏈接
class Producer(threading.Thread):

    def run(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
            'HOST': 'www.umei.cc'
        }
        global all_urls
        while len(all_urls) > 0:
            g_lock.acquire()  # 在訪問all_urls的時候，需要使用鎖機制
            page_url = all_urls.pop()  # 通過pop方法移除最后一個元素，並且返回該值
            g_lock.release()  # 使用完成之后及時把鎖給釋放，方便其他線程使用
            try:
                print("分析" + page_url)
                response = requests.get(page_url, headers=headers, timeout=2)
                html_data= etree.HTML(response.text)
                all_pic_link = html_data.xpath("//a[@class='TypeBigPics']/img/@src")
                print(all_pic_link)
                global all_img_urls
                g_lock.acquire()  # 這里還有一個鎖
                all_img_urls += all_pic_link  # 這個地方注意數組的拼接，沒有用append直接用的+=也算是python的一個新語法吧
                print(all_img_urls)
                g_lock.release()  # 釋放鎖
                time.sleep(0.5)
            except:
                pass

上述代碼用到了繼承的概念，我從threading.Thread中繼承了一個子類，繼承的基礎學習，你可以去翻翻 http://www.runoob.com/python3/python3-class.html 菜鳥教程就行。

線程鎖，在上面的代碼中，當我們操作all_urls.pop()的時候，我們是不希望其他線程對他進行同時操作的，否則會出現意外，所以我們使用g_lock.acquire()鎖定資源，然后使用完成之后，記住一定要立馬釋放g_lock.release(),否則這個資源就一直被占用着，程序無法進行下去了。

匹配網頁中的URL，我使用的是xpath解析，進行匹配。

代碼容易出錯的地方，我放到了

try: except: 里面，當然，你也可以自定義錯誤。

如果上面的代碼，都沒有問題，那么我們就可以在程序入口的地方編寫

for x in range(2):
    t = Producer()
    t.start()

執行程序，因為我們的Producer繼承自threading.Thread類，所以，你必須要實現的一個方法是 def run 這個我相信在上面的代碼中，你已經看到了。然后我們可以執行啦~~~

運行結果：

這樣，圖片詳情頁面的列表就已經被我們存儲起來了。

接下來，我們需要執行這樣一步操作，我想要等待圖片詳情頁面全部獲取完畢，在進行接下來的分析操作。

這里增加代碼

threads= []   
#開啟兩個線程去訪問
for x in range(2):
    t = Producer()
    t.start()
    #threads.append(t)

# for tt in threads:
#     tt.join()

print("進行到我這里了")

把上面的tt.join等代碼注釋打開：

發現一個本質的區別，就是，我們由於是多線程的程序，所以，當程序跑起來之后，print("進行到我這里了")不會等到其他線程結束，就會運行到，但是當我們改造成上面的代碼之后，也就是加入了關鍵的代碼 tt.join() 那么主線程的代碼會等到所以子線程運行完畢之后，在接着向下運行。這就滿足了，我剛才說的，先獲取到所有的圖片詳情頁面的集合，這一條件了。

join所完成的工作就是線程同步，即主線程遇到join之后進入阻塞狀態，一直等待其他的子線程執行結束之后，主線程在繼續執行。這個大家在以后可能經常會碰到。

下面編寫一個消費者/觀察者，也就是不斷關注剛才我們獲取的那些圖片詳情頁面的數組。

添加一個全局變量，用來存儲獲取到的圖片鏈接

pic_links = []            #圖片地址列表

# 消費者
class Consumer(threading.Thread):
    def run(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
            'HOST': 'www.umei.cc'
        }
        global all_img_urls  # 調用全局的圖片詳情頁面的數組
        print("%s is running " % threading.current_thread)
        while len(all_img_urls) > 0:
            print("在這")
            g_lock.acquire()
            img_url = all_img_urls.pop()
            g_lock.release()
            try:
                response = requests.get(img_url,headers=headers)
                html_data = etree.HTML(response.content.decode())
                title = html_data.xpath("//div[@class='ArticleTitle']/strong/text()")
                all_pic_src = html_data.xpath("//div[@class='ImageBody']/p/a/img/@src")
                pic_dict = {title[0]: all_pic_src[0]} # python字典
                global pic_links
                g_lock.acquire()
                pic_links.append(pic_dict)  # 字典數組
                #print(pic_links)
                #print(title + "獲取成功")
                g_lock.release()

            except:
                print("有問題")
            time.sleep(0.5)

#開啟10個線程去獲取鏈接
for x in range(10):
    ta = Consumer()
    ta.start()

運行程序，打印出來是列表里面包含字典的數據

接下來就是，我們開篇提到的那個存儲圖片的操作了，還是同樣的步驟，寫一個自定義的類

我們獲取圖片鏈接之后，就需要下載了，我上面的代碼是首先創建了一個之前獲取到title的文件目錄，然后在目錄里面通過下面的代碼,去創建一個文件。

涉及到文件操作，引入一個新的模塊

import os  #目錄操作模塊

class DownPic(threading.Thread):

    def run(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
        }
        while True:  # 這個地方寫成死循環，為的是不斷監控圖片鏈接數組是否更新
            global pic_links
            # 上鎖
            g_lock.acquire()
            if len(pic_links) == 0:  # 如果沒有圖片了，就解鎖
                # 不管什么情況，都要釋放鎖
                g_lock.release()
                continue
            else:
                pic = pic_links.pop()
                g_lock.release()
                # 遍歷字典列表
                for key,value in pic.items():
                    print("==================",key,value)
                    path = key.strip()
                    is_exists = os.path.exists(path)
                    # 判斷結果
                    if not is_exists:
                        # 如果不存在則創建目錄
                        # 創建目錄操作函數
                        os.makedirs(path)
                        print(path + '目錄創建成功')
                    else:
                        # 如果目錄存在則不創建，並提示目錄已存在
                        print(path + '目錄已存在')
                    filename = path + "/" + key+".jpg"
                    if os.path.exists(filename):
                        continue
                    else:
                        response = requests.get(url=value,headers=headers)
                        with open(filename,'wb') as f:
                            f.write(response.content)
                            f.close()

然后在主程序中編寫代碼

#開啟10個線程保存圖片
for x in range(10):
    down = DownPic()
    down.start()

運行程序，在文件夾里可以看到下載的下來的圖片以及文件

整理全部代碼：

import requests

import threading   #多線程模塊
from lxml import etree #xpath方式爬取
import time #時間模塊

import os

all_img_urls = []       #圖片列表頁面的數組
g_lock = threading.Lock()  #初始化一個鎖

pic_links = []            #圖片地址列表

all_urls = []  # 我們拼接好的圖片集和列表路徑

class Spider():
    # 構造函數，初始化數據使用
    def __init__(self, target_url, headers):
        self.target_url = target_url
        self.headers = headers

    # 獲取所有的想要抓取的URL
    def getUrls(self, start_page, page_num):
        global all_urls
        # 循環得到URL
        for i in range(start_page, page_num + 1):
            url = self.target_url % i
            all_urls.append(url)


# 生產者，負責從每個頁面提取圖片列表鏈接
class Producer(threading.Thread):

    def run(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
            'HOST': 'www.umei.cc'
        }
        global all_urls
        while len(all_urls) > 0:
            g_lock.acquire()  # 在訪問all_urls的時候，需要使用鎖機制
            page_url = all_urls.pop()  # 通過pop方法移除最后一個元素，並且返回該值
            g_lock.release()  # 使用完成之后及時把鎖給釋放，方便其他線程使用
            try:
                print("分析" + page_url)
                response = requests.get(page_url, headers=headers, timeout=2)
                html_data= etree.HTML(response.text)
                all_pic_link = html_data.xpath("//a[@class='TypeBigPics']/@href")
                print(all_pic_link)
                global all_img_urls
                g_lock.acquire()  # 這里還有一個鎖
                all_img_urls += all_pic_link  # 這個地方注意數組的拼接，沒有用append直接用的+=也算是python的一個新語法吧
                #print(all_img_urls)
                g_lock.release()  # 釋放鎖
                time.sleep(0.5)
            except:
                pass


# 消費者
class Consumer(threading.Thread):
    def run(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
            'HOST': 'www.umei.cc'
        }
        global all_img_urls  # 調用全局的圖片詳情頁面的數組
        print("%s is running " % threading.current_thread)
        while len(all_img_urls) > 0:
            g_lock.acquire()
            img_url = all_img_urls.pop()
            g_lock.release()
            try:
                response = requests.get(img_url,headers=headers)
                html_data = etree.HTML(response.content.decode())
                title = html_data.xpath("//div[@class='ArticleTitle']/strong/text()")
                all_pic_src = html_data.xpath("//div[@class='ImageBody']/p/a/img/@src")
                pic_dict = {title[0]: all_pic_src[0]} # python字典
                global pic_links
                g_lock.acquire()
                pic_links.append(pic_dict)  # 字典數組
                print(pic_links)
                #print(title + "獲取成功")
                g_lock.release()

            except:
                print("有問題")
            time.sleep(0.5)


class DownPic(threading.Thread):

    def run(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
        }
        while True:  # 這個地方寫成死循環，為的是不斷監控圖片鏈接數組是否更新
            global pic_links
            # 上鎖
            g_lock.acquire()
            if len(pic_links) == 0:  # 如果沒有圖片了，就解鎖
                # 不管什么情況，都要釋放鎖
                g_lock.release()
                continue
            else:
                pic = pic_links.pop()
                g_lock.release()
                # 遍歷字典列表
                for key,value in pic.items():
                    print("==================",key,value)
                    path = key.strip()
                    is_exists = os.path.exists(path)
                    # 判斷結果
                    if not is_exists:
                        # 如果不存在則創建目錄
                        # 創建目錄操作函數
                        os.makedirs(path)
                        print(path + '目錄創建成功')
                    else:
                        # 如果目錄存在則不創建，並提示目錄已存在
                        print(path + '目錄已存在')
                    filename = path + "/" + key+".jpg"
                    if os.path.exists(filename):
                        continue
                    else:
                        response = requests.get(url=value,headers=headers)
                        with open(filename,'wb') as f:
                            f.write(response.content)
                            f.close()



if __name__ == "__main__":
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
        'HOST': 'www.umei.cc',
    }
    target_url = "http://www.umei.cc/bizhitupian/meinvbizhi/%d.htm"  # 圖片集和列表規則

    spider = Spider(target_url, headers)
    spider.getUrls(1, 16)

    threads= []
    # 開啟兩個線程去訪問
    for x in range(2):
        t = Producer()
        t.start()
        threads.append(t)

    for tt in threads:
        tt.join()

    print("進行到我這里了")

    # 開啟10個線程去獲取鏈接
    for x in range(10):
        ta = Consumer()
        ta.start()

    for x in range(10):
        down = DownPic()
        down.start()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python爬蟲之——爬取妹子圖片 Python 爬蟲入門(二)——爬取妹子圖爬蟲爬取妹子圖 Python爬蟲入門教程 2-100 妹子圖網站爬取 python 爬取妹子圖 python爬蟲–爬取煎蛋網妹子圖片 python 爬蟲爬取煎蛋網妹子圖爬取妹子圖 Python3x 爬取妹子圖 python 爬取煎蛋ooxx妹子圖