爬取今日頭條圖片

聲明：此篇文章主要是觀看靜覓教學視頻后做的筆記，原教程地址https://cuiqingcai.com/

自己很菜慢慢學習，剛學2天有啥問題請多指教

一、實現流程介紹

1.分析今日頭條網站

2.抓取索引頁內容

　　 3.抓取詳細頁內容

4.下載圖片並且保存入數據庫

二、具體實現

2.1 分析今日頭條網站

1. 首先訪問今日頭條網站輸入關鍵字來到索引頁，我們需要通過分析網站來拿到進入詳細頁的url

2.通過點擊查看data中的內容，我們可以看到訪問詳細頁的url，所以這是一會我們需要獲取的信息.

3.隨着向下滑動滾動條顯示更多的圖片索引，我們會發現刷出了很多新的ajax請求如下圖所示，通過這個我們可以知道我們之后可以通過改變offset中的參數來獲取不同的拿到不同的索引界面，從而獲得不同的圖集詳細頁url

4.接下來就是分析查找圖集詳細頁的代碼，來找到圖片的url，這里自己在學習的時候遇到了些坑，利用Google瀏覽器當利用利用“檢查”來分析頁面時候，原網站由

　　https://m.toutiao.com/a6511830952644182542/

轉化為

　　https://m.toutiao.com/a6511830952644182542/

這樣子在DOC中就看不到圖片的信息，自己比較菜找了好久也沒找到，然后就換了個瀏覽器試試發現，火狐瀏覽器不會發生如此情況，所以后面訪問分析的時候利用的火狐瀏覽器


   后面分析代碼可以看出找到了url的位置，在gallery那里，這樣子分析頁面的工作就基本完成了剩下的就是利用代碼實現了

2.2代碼實現

代碼這里就簡要的說說，學了2天發現難處還是在分析網站方面，剩下的就是利用工具進行抓取


import json
import re
from _md5 import md5
from json import JSONDecodeError
import os
from bs4 import BeautifulSoup
import requests
import pymongo
from requests import RequestException
from config import *
from multiprocessing import Pool
client = pymongo.MongoClient(MONGO_URL, connect=False)
db = client[MONGO_DB]


def get_page_index(offset, keyword):
    data = {
        'offset': offset,
        'format': 'json',
        'keyword': keyword,
        'autoload': 'true',
        'count': '20',
        'cur_tab': 1
    }
    headers = {'User-Agent': 'MOzilla/5.0'}
    url = 'https://www.toutiao.com/search_content/?'
    try:
        response = requests.get(url, params=data, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print('請求頁面錯誤')
        return None


def get_page_detail(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print('request the web error', url)
        return None


def parse_page_detail(html, url):
    soup = BeautifulSoup(html, 'lxml')
    title = soup.select('title')[0].get_text()
    pattern = re.compile('gallery: JSON\.parse\("(.*?)"\),', re.S)
    gallery = re.search(pattern, html)
    if gallery:
        gallery = gallery.group(1)
        gallery = re.sub(r'\\', '', gallery)
        data = json.loads(gallery)
        if data and 'sub_images' in data:
            sub_images = data.get('sub_images')
            images = [item.get('url') for item in sub_images]
            for image in images: download_image(image)
            return {
                'title': title,
                'url': url,
                'images': images
             }


def parse_page_index(html):
    try:
        data = json.loads(html)
        if data and 'data' in data.keys():
            for item in data.get('data'):
                yield item.get('article_url')
    except JSONDecodeError:
        pass


def save_to_mongo(result):
    if db[MONGO_TABLE].insert(result):
        print('save to mongoDB sucessfully',result)
        return True
    return False

def download_image(url):
    print('downloading ',url)
    try:
        response = requests.get(url)
        if response.status_code == 200:
            save_image(response.content)
        return None
    except RequestException:
        print('save photo error',url)
    return None


def save_image(content):
    file_path = '{0}/{1}.{2}'.format(os.getcwd(),md5(content).hexdigest(),'jpg')
    if not os.path.exists(file_path):
        with open(file_path,'wb') as f:
            f.write(content)
            f.close()


def main(offest):
    index_html = get_page_index(offest, KEYWORD)
    for url in parse_page_index(index_html):
        if url:
            detail_html = get_page_detail(url)
            if detail_html:
                result = parse_page_detail(detail_html, url)
                if result:
                    save_to_mongo(result)


if __name__ == '__main__':
    groups = [x*20 for x in range(GROUP_START, GROUP_END +1)]
    pool=Pool()
    pool.map(main,groups)

config.py

MONGO_URL = 'localhost'
MONGO_DB = 'toutiao'
MONGO_TABLE = 'toutiao'
GROUP_START =1
GROUP_END =20
KEYWORD = '街拍'

遇到問題：

1.在利用正則表達式進行匹配的時候如果原文有‘(’，')'，'.'‘這類符號時那么你在進行正則表達式書寫的時候應該在前面加'\'

　　　　　　 pattern = re.compile('gallery: JSON\.parse\("(.*?)"\),', re.S)

2. db = client[MONGO_DB]這里應該是方括號而不是（），否則無法正常訪問數據庫

3. 在Google瀏覽器中找不到圖片url，然后使用的是火狐瀏覽器然后就找到了2333333

運行之后就可以把圖片爬取下來了，然后就可以看.................................................................. emmmm,我是學技術不是看圖的

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲—分析Ajax爬取今日頭條圖片爬取今日頭條 Python 爬蟲實例（2）—— 爬取今日頭條爬蟲實例之爬取今日頭條組圖 Python爬取今日頭條段子 scrapy爬取今日頭條爬取今日頭條文章練習4-今日頭條爬取爬蟲--今日頭條使用scrapy爬蟲,爬取今日頭條首頁推薦新聞（scrapy+selenium+PhantomJS）