Google圖片和NASA 網站圖片的爬蟲

本文轉載自查看原文 2019-04-19 15:08 1032 深度學習

1.根據關鍵字爬取NASA網站上的圖片

首先針對需要爬取的網站進行分析，輸入關鍵字查找需要的內容

通過關鍵字請求，網頁每次會加載20張的縮略圖，分析網頁源碼能夠很容易的找到縮略圖的url:

然后再點開縮略圖，會鏈接的另一個網頁，從這里可以分析出更高分辨率大圖的url：

最后根據取得的url地址下載原圖就可以了，下面附上源代碼


# -*- coding: utf-8 -*-
import urllib
import requests
from bs4 import BeautifulSoup
import re
import json

def getUrl(keyword):
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:53.0) Gecko/20100101 Firefox/53.0'
    results = requests.get("https://nasasearch.nasa.gov/search/images",
                           params={'affiliate': 'nasa', 'query': keyword},
                           headers={'User-Agent': user_agent})

    results.encoding = 'utf-8'
    s = requests.session()
    s.keep_alive = False
    soup = BeautifulSoup(results.text, 'lxml')
    # 獲取網頁中的所有div ,class=url的文本
    for link in soup.find_all('div', class_='url'):
        # 拼接url
        html = requests.get('https://'+link.text)
        soup1 = BeautifulSoup(html.text, 'lxml')
        # 獲取字段
        data = soup1.find('script', attrs={"type": "application/ld+json"})
        # json字符串轉換為字典
        jsonobj = json.loads(data.text)
        # 從json塊中獲取圖片地址
        imageUrl = jsonobj['@graph'][0]['image']['url']
        namelist = imageUrl.split('/')
        # 獲取圖片名稱
        name = namelist[-1].split('.')[0]
        downloadImage(imageUrl, name)

def downloadImage(imageUrl, name):
    path = 'D:/space/'
    print(name)
    if imageUrl is not None:
        try:
            image_file = requests.get(imageUrl, stream=True, timeout=9)
        except requests.exceptions.RequestException:
            print('網絡異常')
        # else:
            # if image_file.status_code is not requests.codes.ok:
            #print('{}'.format(imageUrl) + '鏈接為空！')
        else:
            image_file_path = '{}{}.jpg'.format(path, name)
            print('正在下載:' + '{}.jpg'.format(name))
            with open(image_file_path, 'wb') as f:
                f.write(image_file.content)
            print('下載完成！')


if __name__ == "__main__":
    keyword = input()
    getUrl(keyword)

2.爬取谷歌圖片

這里主要使用了一個開源代碼，爬蟲作者github地址：https://github.com/YoongiKim/AutoCrawler
爬蟲的效果還是很不錯的，具體的使用作者在主頁也詳細的說明了

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲一個圖片網站對大型網站圖片進行爬蟲爬蟲實戰系列（一）：爬取某網站圖片 Python 爬蟲之 Beautifulsoup4，爬網站圖片 python網絡爬蟲抓取網站圖片我用java爬蟲爬了一個圖片網站 Python爬蟲實戰：批量下載網站圖片爬蟲網站圖片且保存到本地 Google，Baidu，Bing三大搜素引擎圖片爬蟲網絡爬蟲（爬取網站圖片，自動保存本地）