google、baidu、yahoo、bing這些搜索引擎網站的圖片抓取方法匯總

本文轉載自查看原文 2020-04-27 17:25 852 爬蟲/ Python開發

icrawler基本用法

內置爬蟲

該框架包含6個內置的圖像抓取工具。

谷歌
bing
百度
Flickr
通用網站圖片爬蟲（greedy）
UrlList（抓取給定URL列表的圖像）

以下是使用內置抓取工具的示例。搜索引擎抓取工具具有相似的界面。

第一步：

pip install icrawler

第二步：

from icrawler.builtin import BaiduImageCrawler 
from icrawler.builtin import BingImageCrawler 
from icrawler.builtin import GoogleImageCrawler 
"""
parser_threads：解析器線程數目，最大為cpu數目
downloader_threads：下載線程數目，最大為cpu數目
storage：存儲地址，使用字典格式。key為root_dir
keyword:瀏覽器搜索框輸入的關鍵詞
max_num:最大下載圖片數目
"""

#谷歌圖片爬蟲
google_storage = {'root_dir': '/Users/suosuo/Desktop/icrawler學習/google'}
google_crawler = GoogleImageCrawler(parser_threads=4, 
                                   downloader_threads=4, 
                                   storage=google_storage)
google_crawler.crawl(keyword='beauty', 
                     max_num=10)


#必應圖片爬蟲
bing_storage = {'root_dir': '/Users/suosuo/Desktop/icrawler學習/bing'}
bing_crawler = BingImageCrawler(parser_threads=2,
                                downloader_threads=4, 
                                storage=bing_storage)
bing_crawler.crawl(keyword='beauty',
                   max_num=10)


#百度圖片爬蟲
baidu_storage = {'root_dir': '/Users/suosuo/Desktop/icrawler學習/baidu'}

baidu_crawler = BaiduImageCrawler(parser_threads=2,
                                  downloader_threads=4,
                                  storage=baidu_storage)
baidu_crawler.crawl(keyword='美女', 
                    max_num=10)

注：google頁面升級，上面方法暫時不可用

GreedyImageCrawler

如果你想爬某一個網站，不屬於以上的網站的圖片，可以使用貪婪圖片爬蟲類，輸入目標網址。

from icrawler.builtin import GreedyImageCrawler

storage= {'root_dir': '/Users/suosuo/Desktop/icrawler學習/greedy'}
greedy_crawler = GreedyImageCrawler(storage=storage)
greedy_crawler.crawl(domains='http://desk.zol.com.cn/bizhi/7176_88816_2.html', 
                     max_num=6)

UrlListCrawler

如果你已經擁有了圖片的下載地址，可以直接使用UrlListCrawler，為了高效抓取，可以使用多線程方式下載，快速抓取目標數據。

from icrawler.builtin import UrlListCrawler

storage={'root_dir': '/Users/suosuo/Desktop/icrawler學習/urllist'}
urllist_crawler = UrlListCrawler(downloader_threads=4, 
                                 storage=storage)

#輸入url的txt文件。
urllist_crawler.crawl('url_list.txt')

詳細：https://www.ctolib.com/topics-125069.html

google抓取

我們在上面提到google圖片無法抓取，然后我們接着往下面看看方法。

鏈接：https://pan.baidu.com/s/1gunLzHq4B-d-oPorzHiU3g
提取碼：e5na

先去我的網盤下載到本地，接着看目錄結構，注意抓取goole用的是 selenium,這個包安裝方法及環境設置需要自己搞定！！！

運行看看頁面，是可視化的哦

當然，這個你還有什么看不懂的直接到github上看看，https://github.com/sczhengyabin/Image-Downloader，親測可用。

yahoo抓取

mport requests
import re
import os



class Yahoo_spider():
    def __init__(self):
        self.headers = {
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36",
        }
        self.keyword = "詐騙信息"
        self.path = ''
        self.count = 0
        self.max_page = 700

    def start_one_parse(self):
        """該請求為第一頁靜態數據"""
        print('開始抓取第1頁')
        url = 'https://images.search.yahoo.com/search/images;_ylt=Awr9FqqfDqVeFy4AT7SJzbkF?p='+ self.keyword +'&fr2=p%3As%2Cv%3Ai'
        response = requests.get(url = url,headers = self.headers)
        pic_list  = re.findall(r'data-src=\'(.*?)\'', response.text, re.S)
        self.save_pic(pic_list)


    def start_two_parse(self):
        """請求第二頁為動態接口數據"""
        for page in range(61,self.max_page,61):
            print('開始抓取第{}頁'.format(page // 61 + 1))
            url = "https://images.search.yahoo.com/search/images?fr2=p%3As%2Cv%3Ai&o=js&p="+self.keyword+"&tmpl=&nost=1&b="+str(page)+"&iid=Y.2&ig=0afd1ebeedac47e896000000003f9432&rand=1587876278637"
            response = requests.get(url=url, headers=self.headers).json()['html']
            pic_list = re.findall(r'data-src=\'(.*?)\'', response, re.S)
            self.save_pic(pic_list)


    def save_pic(self,pic_list):
        """保存圖片"""
        for pic in pic_list:
            try:
                self.count += 1
                url = pic.replace('&w=300&h=300', '&w=10000&h=10000')  #處理尺寸
                pic_path = os.path.join(self.path,'yahoo_'+ str(self.count) + '.jpg')
                response = requests.get(url=url,headers = self.headers,timeout = 10)
                with open(pic_path,'wb') as f:
                    f.write(response.content)
                print('序號：{},圖片鏈接；{},保存成功'.format(self.count,url))
            except:
                pass


if __name__ == '__main__':
    spider = Yahoo_spider()
    spider.start_one_parse()
    spider.start_two_parse()

yahoo沒什么難度，看看代碼很簡單就懂了。　　

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 google搜索引擎爬蟲爬網站原理 Google，Baidu，Bing三大搜素引擎圖片爬蟲 bing搜索引擎子域名收集（Python腳本） bing 搜索引擎無法訪問 bug 國內使用google搜索引擎 Google 搜索引擎語法 Google桌面搜索引擎搜索引擎如何抓取網頁和如何索引網頁？國內如何使用谷歌（google）搜索引擎進行搜索？搜索引擎搜索技巧—搜索某個網站中的內容