筆記整理4——python實現提取圖片exif信息

本文轉載自查看原文 2019-08-23 20:58 1835

一.主要思路：

(1).從對應網頁中找到"所有的圖片標簽",
通過url得到對應的html內容。之后通過
BeautifulSoup將其解析成一棵html元素樹。
查找所有的"圖片標簽"

(2).下載圖片
通過得到的標簽提取出SRC，得到圖片地址，下載圖片。

(3).提取元信息exif
將圖片通過相應的庫實現exif信息的提取，對exif進行遍歷，存儲到字典變量中。
其中要判斷是否存在exif信息(有些不能提取)，是否存在GPSInfo信息(有些壓縮時
該信息失去，或本來就沒有)，若是不符合的，刪除該圖片。

(4).刪除圖片
利用os的remove功能。只要有對應目錄。就可以實現刪除。
事實上可以利用os模塊實現windows和linux的許多自動化工作。

3.使用模塊和方法匯總

urlparse模塊
該模塊定義了一個標准接口，用於在組件中解析統一資源定位符（URL）字符串（尋址方案，網絡位置，路徑等），將組件組合回URL字符串，並將“相對URL”轉換為絕對URL給出“基本URL”。

urlsplit函數類似於urlparse
將URL解析為六個組件，返回一個6元組。這對應於URL的一般結構：scheme：// netloc / path; parameters？query＃fragment。每個元組項都是一個字符串，可能是空的。組件不會以較小的部分分解（例如，網絡位置是單個字符串），並且不會展開％escapes。如上所示的分隔符不是結果的一部分，除了路徑組件中的前導斜杠，如果存在則保留。

os.path.basename(path)
Return the base name of pathname path. where basename for '/foo/bar/' returns 'bar', the basename() function returns an empty string ('').

Beautiful Soup
是一個可以從HTML或XML文件中提取數據的Python庫.它能夠通過你喜歡的轉換器實現慣用的文檔導航,查找,修改文檔的方式.Beautiful Soup會幫你節省數小時甚至數天的工作時間.
官方文檔：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

PIL庫官方文檔,通常用作圖片處理，本程序中用到Image_getexif()方法提取exif，但僅能對jpg和jpeg圖片作處理
且不能識別大小寫后綴
http://effbot.org/imagingbook/
ExifTags.TAGS(TagName = TAGS.get(tag,tag))
is a dictionary. As such, you can get the value for a given key by using TAGS.get(key). If that key does not exist, you can have it return to you a default value by passing in a second argument TAGS.get(key, val)
Source: http://www.tutorialspoint.com/python/dictionary_get.htm

pip install exifread，本程序中處理png圖片的exif信息，利用了
exifread.process_file(imageFile) 方法
通過tags = exifread.process_file(fd) 這個函數讀取了圖片的exif信息，其下為exif格式
{'Image ImageLength': (0x0101) Short=3024 @ 42,
......
'Image GPSInfo': (0x8825) Long=792 @ 114,
'Thumbnail JPEGInterchangeFormat': (0x0201) Long=928 @ 808,
......
}
4.錯誤與解決方案
'str' object has no attribute 'read'
事實上是參數本身為一字符串，而參數要求是一個二進制文件，
這是傳參是僅僅傳了一個文件名字(通過名字打開文件)，而非一個文件
'PngImageFile' object has no attribute '_getexif'
該錯誤是因為_getexif不能提取.png文件
解決方法：
可以使用exifread模塊讀取，但該模塊僅僅可以讀取.png文件

依舊有其他BUG存在，尚未解決，但不影響基本使用，不得不說，
原作者寫的該部分代碼實在是爛，對於現今的網站根本無法使用

5.總結與思考
(1).最終結果依舊沒有跑出exif信息，可能是有的加密，或者
部分圖片本身未存儲exif信息。
(2).有的圖片本身格式有gif，JPG，svg等，並沒有對此進行嚴格的過濾。
(3).有的網站本身有反爬機制，不能進行圖片的爬取，你如www.qq.com

二.代碼

#!/usr/bin/python
# coding: utf-8

import os
import exifread
import urllib2
import optparse
from urlparse import urlsplit
from os.path import basename
from bs4 import BeautifulSoup
from PIL import Image
from PIL.ExifTags import TAGS



def findImages(url): #找到該網頁的所有圖片標簽
    print '[+] Finding images of '+str(urlsplit(url)[1])
    resp = urllib2.urlopen(url).read()
    soup = BeautifulSoup(resp,"lxml")
    imgTags = soup.findAll('img')
    return imgTags


def downloadImage(imgTag):  #根據標簽從該網頁下載圖片
    try:
        print '[+] Downloading image...'
        imgSrc = imgTag['src']
        imgContent = urllib2.urlopen(imgSrc).read()
        imgName = basename(urlsplit(imgSrc)[2])
        f = open(imgName,'wb')
        f.write(imgContent)
        f.close()
        return imgName
    except:
        return ''

def delFile(imgName):   #刪除該目錄下下載的文件
    os.remove('/mnt/hgfs/temp/temp/python/exercise/'+str(imgName))
    print "[+] Del File"

def exifImage(imageName):  #提取exif信息，若無則刪除
    if imageName.split('.')[-1] == 'png':
        imageFile = open(imageName,'rb') 
        Info = exifread.process_file(imageFile) 
    elif imageName.split('.')[-1] == 'jpg' or imageName.split('.')[-1] == 'jpeg':
        imageFile = Image.open(imageName)
        Info = imageFile._getexif()
    else:
        pass
    try:
        exifData = {}
        if Info:
            for (tag,value) in Info:
                TagName = TAGS.get(tag,tag)
                exifData[TagName] = value
            exifGPS = exifData['GPSInfo']
            if exifGPS:
                print '[+] GPS: '+str(exifGPS)
            else:
                print '[-] No GPS information'
                delFile(imageName)
        else:
            print '[-] Can\'t detecated exif'
            delFile(imageName)
    except Exception, e:
        print e
        delFile(imageName)
        pass 


def main():
    parser = optparse.OptionParser('-u <target url>')
    parser.add_option('-u',dest='url',type='string',help='specify the target url')
    (options,args) = parser.parse_args()
    url = options.url

    if url == None:
        print parser.usage
        exit(0)

    imgTags = findImages(url)
    for imgTag in imgTags:
        imgFile = downloadImage(imgTag)
        exifImage(imgFile)


if __name__ == '__main__':
    main()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python讀取圖片exif信息獲取圖片的EXIF信息 python實現批量提取圖片中的信息並保存利用Python讀取圖片exif敏感信息 Python 提取圖片中的GPS信息 C#讀取圖片Exif信息 Java讀取圖片和EXIF信息 EXIF.Js：讀取圖片的EXIF信息 Java讀取圖片exif信息實現圖片方向自動糾正獲取圖片exif信息/獲取視頻詳細信息