python爬蟲–爬取煎蛋網妹子圖片

本文轉載自查看原文 2017-12-22 19:08 1093 網絡爬蟲/ 編程

前幾天剛學了python網絡編程，書里沒什么實踐項目，只好到網上找點東西做。
一直對爬蟲很好奇，所以不妨從爬蟲先入手吧。

Python版本：3.6

這是我看的教程：Python - Jack -Cui -CSDN

大概學了一下urllib，beautifulsoup這兩個庫，也看了一些官方文檔，學會了這兩個庫的大概的用法。

urllib用來爬取url的內容，如html文檔等。beautifulsoup是用來解析html文檔，就像js的DOM操作一樣。簡單流程如下：

from urllib import request
from bs4 import BeautifulSoup
#urllib操作
url = 'http://blog.csdn.net/'
#編輯header，至少要把User-Agent寫上，否則python會自動加上，導致直接被有反爬蟲機制的網站識別
#有很多網站會有個Content-Encoding=gzip，進行后面的輸出時一定要gzip解壓縮，具體怎么解壓縮，看看這個：https://www.jianshu.com/p/2c2781462902
headers = {
    'Connection' : 'keep-alive',
    'Cache-Control' : 'max-age=0',
    'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36'
}
req = request.Request(url, headers=headers)
response = request.urlopen(req)
html = response.read()
#輸出html內容
print(html)
#beautifulsoup操作
soup = BeautifulSoup(html, 'lxml')
#選取所有a標簽
a_tags = soup.find_all('a')
#獲得a標簽中的內容(string屬性)
for item in a_tags:
    print(item.string)

基礎內容差不多就這些，下面來爬一下煎蛋。
先看一下其源代碼，發現源html中並沒有圖片的鏈接，而是只有一個沒有url的img標簽和一個有img-hash屬性的span標簽

查看js文件，尋找jandan_load_img()函數

分析可知，由span的img-hash屬性的值和一個固定的hash作為參數，傳入一個f_開頭的函數，將返回值直接插入html文檔中的a的href中，可見，返回值就是圖片的真實url。所以思路有了：用python模擬js的函數，先取img-hash的值，再傳入一個解密函數（對應f_開頭函數），得到圖片url。再看一下jandan_load_img()函數，取回url之后，如果文件后綴名為.gif，則在其中加入'thumb180'字符串，這個好做。
下面直接貼個源碼吧：

#!/usr/bin/env python3
from bs4 import BeautifulSoup
from urllib import request
import argparse
import hashlib
import base64
import gzip
import time
import os
import io
import re

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36',
    'Accept-Encoding':'gzip, deflate',
    'Accept-Language':'zh-CN,zh;q=0.9',
    }

def md5(src):
    m = hashlib.md5()
    m.update(src.encode('utf-8'))
    return m.hexdigest()

def decode_base64(data):
    missing_padding=4-len(data)%4
    if missing_padding:
        data += '='* missing_padding
    return base64.b64decode(data)

def calculate_url(img_hash, constant):
    k = 'DECODE'
    q = 4
    constant = md5(constant)
    o = md5(constant[0:16])
    #n = md5(constant[16:16])
    n = md5(constant[16:32])
    
    l = img_hash[0:q]
    c = o+md5(o + l)

    img_hash = img_hash[q:]
    k = decode_base64(img_hash)
    h = []
    for g in range(256):
        h.append(g)

    b = []
    for g in range(256):
        b.append(ord(c[g % len(c)]))

    f = 0
    for g in range(256):
        f = (f + h[g] + b[g]) % 256
        tmp = h[g]
        h[g] = h[f]
        h[f] = tmp

    t = ""
    f = 0
    p = 0
    for g in range(len(k)):
        p = (p + 1) % 256
        f = (f + h[p]) % 256
        tmp = h[p]
        h[p] = h[f]
        h[f] = tmp
        t += chr(k[g] ^ (h[(h[p] + h[f]) % 256]))
    t = t[26:]

    return t

def get_raw_html(url):
    req = request.Request(url=url, headers=headers)
    response = request.urlopen(req)
    text = response.read()
    encoding = response.getheader('Content-Encoding')
    if encoding == 'gzip':
        buf = io.BytesIO(text)
        translated_raw = gzip.GzipFile(fileobj=buf)
        text = translated_raw.read()

    text = text.decode('utf-8')
    return text

def get_soup(html):
    soup = BeautifulSoup(html, 'lxml')
    return soup

def get_preurl(soup):
    preurl = 'http:'+soup.find(class_='previous-comment-page').get('href')
    return preurl

def get_hashesAndConstant(soup, html):
    hashes = []
    for each in soup.find_all(class_='img-hash'):
        hashes.append(each.string)

    js = re.search(r'&lt;script\ssrc=\"\/\/(cdn.jandan.net\/static\/min\/.*?)\"&gt;.*?&lt;\/script&gt;', html)
    jsFileURL = 'http://'+js.group(1)
    jsFile = get_raw_html(jsFileURL)

    target_func = re.search(r'f_\w*?\(e,\"(\w*?)\"\)', jsFile)
    constant_hash = target_func.group(1)

    return hashes, constant_hash

def download_images(urls):
    if not os.path.exists('downloads'):
        os.makedirs('downloads')
    for url in urls:
        filename = ''
        file_suffix = re.match(r'.*(\.\w+)', url).group(1)
        filename = md5(str(time.time()))+file_suffix
        request.urlretrieve(url, 'downloads/'+filename)
        time.sleep(3)

def spider(url, page):
    #get hashes, constant-hash, previous page's url
    html = get_raw_html(url)
    soup = get_soup(html)

    params = get_hashesAndConstant(soup, html)
    hashes = params[0]
    constant_hash = params[1]

    preurl = get_preurl(soup)
    
    urls = []
    index = 1
    for each in hashes:
        real_url = 'http:'+calculate_url(each, constant_hash)
        replace = re.match(r'(\/\/w+\.sinaimg\.cn\/)(\w+)(\/.+\.gif)', real_url)
        if replace:
            real_url = replace.group(1)+'thumb180'+replace.group(3)
        urls.append(real_url)
        index += 1

    download_images(urls)

    page -= 1
    if page &gt; 0:
        spider(preurl, page)


if __name__ == '__main__':
    #user interface
    parser = argparse.ArgumentParser(description='download images from Jandan.net')
    parser.add_argument('-p', metavar='PAGE', default=1, type=int, help='the number of pages you want to download (default 1)')
    args = parser.parse_args()
    
    #start crawling
    url = 'http://jandan.net/ooxx/'
    spider(url, args.p)

防止被識別，采取了以下措施：

運行的時候可以加個-p參數，是要下載的頁數，默認是1。
每爬一張圖片暫停3s，為了防止被服務器識別，你們嫌慢的話可以改短一點。

下載的圖片保存在當前目錄的downloads文件夾下（沒有則創建）。
ps:用Windows的同學請注意！這里說的當前目錄不是指這個python文件的路徑，而是cmd中的當前路徑！我一開始是在Linux上做的，后來在Windows測試的時候一直找不到downloads文件夾，把源代碼檢查了好久，最后才發現是路徑問題。。

pps:此項目也可以在我的Github中找到(更有.exe文件等你來發現~滑稽)。

參考：

http://blog.csdn.net/c406495762/article/details/71158264

http://blog.csdn.net/van_brilliant/article/details/78723878

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python 爬蟲爬取煎蛋網妹子圖 Python 爬蟲：煎蛋網妹子圖項目: python爬蟲福利煎蛋網妹子圖爬取煎蛋XXOO妹子圖片 python 爬取煎蛋ooxx妹子圖煎蛋網妹子圖爬蟲總結 Python爬蟲之——爬取妹子圖片用python爬取全網妹子圖片【附源碼筆記】 Python 爬蟲入門(二)——爬取妹子圖 Python的scrapy之爬取妹子圖片