python--magic module 文件類型識別


前言:接觸magic module是由於工作中對的文件類型判斷需求,網上查了下,python內置的有mimetypes module,filetype module,與使用mimetypes庫相比,更可靠的方法是使用magic軟件包。

magic

magic是libmagic文件標識庫的封裝,libmagic是一個根據文件頭識別文件類型的開發庫,因此可以實現對文件類型的判斷,在Django上,還可以確保MIME類型與UploadedFile.content_type相匹配。

libmagic

  • Usage:
    • import magic
      
      detected = magic.detect_from_filename('magic.py')
      print 'Detected MIME type: {}'.format(detected.mime_type)
      print 'Detected encoding: {}'.format(detected.encoding)
      print 'Detected file type name: {}'.format(detected.name)

確定文件mime-type時,選擇的工具簡稱為file,其后端稱為libmagic如果您想將任何libmagic綁定與python一起使用,則需要使用此工具,該工具已經附帶了自己的python綁定,稱為file-magic。file-magic綁定包含到文件上。如果同時安裝文件和python-magic,則python模塊將magic引用前者。

python-magic

  • module name: magic
  • pypi: python-magic
  • source: https://github.com/ahupp/python-magic
  • install:
    • pip install python-magic #window下依賴python-magic-bin, pip install python-magic-bin
  • usage:
    • >>> import magic
      >>> magic.from_file("testdata/test.pdf")
      'PDF document, version 1.2'
      # recommend using at least the first 2048 bytes, as less can produce incorrect identification
      >>> magic.from_buffer(open("testdata/test.pdf").read(2048)) 
      'PDF document, version 1.2'
      >>> magic.from_file("testdata/test.pdf", mime=True)
      'application/pdf'
      >>> f = magic.Magic(uncompress=True) >>> f.from_file('testdata/test.gz') 'ASCII text (gzip compressed data, was "test", last modified: Sat Jun 28 21:32:52 2008, from Unix)'
      >>> f = magic.Magic(mime=True, uncompress=True) >>> f.from_file('testdata/test.gz') 'text/plain'

filemagic

      這個庫與file-magic有一些相似之處,包含在libmagic。

 python-magic用例

import magic

file_type = magic.from_buffer(open("file_types/Bs.tar.gz",'rb').read(2048)) #1
#or
file_type = magic.from_file("file_types/Bs.tar.gz", mime=True) #2

f = magic.Magic(uncompress=True)
ff=f.from_file('file_types/Bs.tar.gz') #3

print(file_type,ff)   #gzip compressed data, last modified: Tue Dec 10 08:46:57 2019, from Unix

   我更喜歡的是Magic方法,Magic是libmagic C庫的包裝。更強大更直接,包含magic的數據庫方法,並且可以進行mime_encoding檢測。

        但有網友友情提示說不建議用於一般用途,特別是跨多個線程共享並不安全,如果嘗試這樣做會失敗。這個還沒深究,但是我們可以先了解magic對Magic方法的調用:

def _get_magic_type(mime):
    i = _instances.get(mime)
    if i is None:
        i = _instances[mime] = Magic(mime=mime)
    return i

  可以看到,如果magic方法沒有獲取到mime,還調用了Magic,所以對於安全性和可行性,我們打個問號,等菜鳥我修煉一段時間,攢點經驗值,再回來研究研究吧。  

try:
    ms = magic.open(magic.MAGIC_NONE)
    ms.load()
except:
    ms = None

  magic的一些常量:

MAGIC_NONE = 0x000000 # No flags
MAGIC_DEBUG = 0x000001 # Turn on debugging
MAGIC_SYMLINK = 0x000002 # Follow symlinks
MAGIC_COMPRESS = 0x000004 # Check inside compressed files
MAGIC_DEVICES = 0x000008 # Look at the contents of devices
MAGIC_MIME = 0x000010 # Return a mime string
MAGIC_MIME_ENCODING = 0x000400 # Return the MIME encoding
MAGIC_CONTINUE = 0x000020 # Return all matches
MAGIC_CHECK = 0x000040 # Print warnings to stderr
MAGIC_PRESERVE_ATIME = 0x000080 # Restore access time on exit
MAGIC_RAW = 0x000100 # Don't translate unprintable chars
MAGIC_ERROR = 0x000200 # Handle ENOENT etc as real errors

MAGIC_NO_CHECK_COMPRESS = 0x001000 # Don't check for compressed files
MAGIC_NO_CHECK_TAR = 0x002000 # Don't check for tar files
MAGIC_NO_CHECK_SOFT = 0x004000 # Don't check magic entries
MAGIC_NO_CHECK_APPTYPE = 0x008000 # Don't check application type
MAGIC_NO_CHECK_ELF = 0x010000 # Don't check for elf details
MAGIC_NO_CHECK_ASCII = 0x020000 # Don't check for ascii files
MAGIC_NO_CHECK_TROFF = 0x040000 # Don't check ascii/troff
MAGIC_NO_CHECK_FORTRAN = 0x080000 # Don't check ascii/fortran
MAGIC_NO_CHECK_TOKENS = 0x100000 # Don't check ascii/tokens

 

 

分享文章:

cuckoo里的文件識別功能:https://www.cnblogs.com/viwilla/p/5051896.html

對文件格式的判斷代碼

def _get_filetype(self, data):
        """Gets filetype, uses libmagic if available.
        @param data: data to be analyzed.
        @return: file type or None.
        """
        if not HAVE_MAGIC:
            return None

        try:
            ms = magic.open(magic.MAGIC_NONE)
            ms.load()
            file_type = ms.buffer(data)
        except:
            try:
                file_type = magic.from_buffer(data)
            except Exception:
                return None
        finally:
            try:
                ms.close()
            except:
                pass

        return file_type

 分享找到的一個挺好的用例:https://www.cnblogs.com/17bdw/p/10042549.html

參考鏈接:https://stackoverflow.com/questions/43580/how-to-find-the-mime-type-of-a-file-in-python


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM