利用python-docx批量處理Word文件—圖片

本文轉載自查看原文 2018-10-30 13:36 2747 技術交流之python

圖片是Word的一種特殊內容，這篇文章主要內容是如何利用python-docx批量提取Word中的圖片，以及如何在Word國插入圖片。

1.提取Word中的圖片並保護成指定格式

docx好像並沒有直接獲取圖片的方法，網上的資料也很少，有用的資料我就找到這一篇：
如何從pythondocx段中獲取圖像(Inlineshape)
說實話，這篇文章我看的不是太懂，而且這個方法只能獲得內聯的圖片，什么是內聯的圖片呢，我也不知道，我只知道我們在word中直接插入的圖片不屬於這種，也就是這種方法並不能獲得word中直接插入的圖片，我用add_picture()插入一張圖片，用該方法可以獲得。受這篇文章的啟發，我看了一下python-docx的源碼，雖然沒有看懂，但也得到一個用有的信息：python-docx會將wrod文件轉換成Proxy Type（不敢翻譯）格式進行處理。Proxy Type格式是什么樣的呢，其實質是xml，不同的類型會被轉成不同的Proxy Type，以Document為例，可以用document._element.xml查看被轉換后的內容：
在這里插入圖片描述
這就是word內容轉換成Proxy Type后的形式（大部分信息被我折疊了），我對xml研究不多，可以看出所有標簽都是<w:x>的形式，整個文檔包含在<w:document></w:document>標簽中，每段以<w:p>開始，</w:p>結束，圖片在docx中也是段落，因此我們過以通過遍歷整個xml找到包含圖片的段落，要通過遍歷找到圖片，圖片所在的段落必須有其特殊之處，不然我們也無判斷。下面是一幅圖處的Proxy Type的內容：

<w:p xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" w:rsidR="00B20677" w:rsidRDefault="00D47C7B" w:rsidP="00ED22C2">
  <w:pPr>
    <w:rPr>
      <w:rFonts w:ascii="宋體" w:eastAsia="宋體" w:hAnsi="宋體"/>
      <w:lang w:eastAsia="zh-CN"/>
    </w:rPr>
  </w:pPr>
  <w:r>
    <w:rPr>
      <w:rFonts w:ascii="宋體" w:eastAsia="宋體" w:hAnsi="宋體"/>
      <w:lang w:eastAsia="zh-CN"/>
    </w:rPr>
    <w:pict>
      <v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f">
        <v:stroke joinstyle="miter"/>
        <v:formulas>
          <v:f eqn="if lineDrawn pixelLineWidth 0"/>
          <v:f eqn="sum @0 1 0"/>
          <v:f eqn="sum 0 0 @1"/>
          <v:f eqn="prod @2 1 2"/>
          <v:f eqn="prod @3 21600 pixelWidth"/>
          <v:f eqn="prod @3 21600 pixelHeight"/>
          <v:f eqn="sum @0 0 1"/>
          <v:f eqn="prod @6 1 2"/>
          <v:f eqn="prod @7 21600 pixelWidth"/>
          <v:f eqn="sum @8 21600 0"/>
          <v:f eqn="prod @7 21600 pixelHeight"/>
          <v:f eqn="sum @10 21600 0"/>
        </v:formulas>
        <v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"/>
        <o:lock v:ext="edit" aspectratio="t"/>
      </v:shapetype>
      <v:shape id="_x0000_i1025" type="#_x0000_t75" style="width:6in;height:214.5pt">
        <v:imagedata r:id="rId8" o:title="syh"/>
      </v:shape>
    </w:pict>
  </w:r>
</w:p>

可以看到圖片信息包含在<w:pict></w:pict>標簽中，因此我們可以通過該標簽寫信圖片段落。
document有一個part屬性，part有一個related_parts屬性，其定義如下：

@property
def related_parts(self):
    """ Dictionary mapping related parts by rId, so child objects can resolve explicit relationships present in the part XML, e.g. sldIdLst to a specific |Slide| instance. """
    return self.rels.related_parts

再看rels.related_partsr的定義：

@property
def related_parts(self):
    """ dict mapping rIds to target parts for all the internal relationships in the collection. """
    return self._target_parts_by_rId

self.rels.related_parts是一個字典，這個字典可以通過rId映射對應的內容，恰好在圖片對應的Proxy Type內容（imagedata標簽）中發現了這個屬性，

<v:shape id="_x0000_i1025" type="#_x0000_t75" style="width:6in;height:214.5pt">
     <v:imagedata r:id="rId8" o:title="syh"/>
</v:shape>

可以看到，這個圖片對應rId是rId8，運行

doc.part.related_parts['rId9']

發現前沒有報錯，將其存儲成圖片后，驚喜出現了——這就是該圖片的內容。
整理上面的思路，獲得圖片的過程分3步：

獲得各段的Proxy Type對象，它是一個xml；
遍歷該xml，如果pict鍵存在，該段是圖片，繼續遍歷獲得rId；
利用related_parts獲得圖片內容。

下面詳述該過程：

1.1 獲得各段對應的`Proxy Type xml`數據

proxy=[]
for p in doc.paragraphs:
    proxy.append(p._element.xml)

1.2 遍歷`xml`，找到圖片所在的段落並獲得`rid`

import xml.etree.cElementTree as ET
for p in proxy:
    #一段一個根樹
    root=ET.fromstring(p)
    #獲得<w:r>樹，所有的<w:pict>樹均是<w:r>樹的子樹
    pictr_str="%sr" % re.match('{\S+}',root.tag).group(0)
    pictrs=root.findall(pictr_str)
    pict_str="%spict" % re.match('{\S+}',root.tag).group(0)
    picts=[]
    rIds=[]
    for pictr in pictrs:
        #獲得所有<w:pict>標簽
        pict=pictr.findall(pict_str)
        #如果<w:pict>存在
        if len(pict)>0:
            picts.append(pict[0])
    for pict in picts:
        shape_str="%sshape" % re.match('{\S+}',pict[0].tag).group(0)
        #獲得<v:shape>標簽
        shape=pict.findall(shape_str)[0]
        attrib=[]
        #<w:imagedata>標簽
        imagedata=shape.findall("%simagedata" % re.match('{\S+}',pict[0].tag).group(0))
        rIds.append(imagedata[0].attrib['{http://schemas.openxmlformats.org/officeDocument/2006/relationships}id'])

ps:這部分代碼需要對照xml才能看懂。

1.3 獲得image數據

imgs=[]
for rid in rIds:
    imgs.append(doc.part.related_parts[rid])

1.4 保存圖片到本地

i=1
for img in imgs:
 
    f=open("img%d.jpg" % i,'wb')
    f.write(img.blob)
    f.close()
    i+=1

2.給word插入圖片

插入圖片就比較簡單了:

doc.add_picture('img_path',width=Cm(16),height=Cm(12))

后記：從word中讀出圖片在點復雜，這個代碼肯定不能滿足所有word文件，也可能存在很多問題，畢竟這個在官方API中並沒有提到，我只是拋磚引玉，如果大家有更好的方法歡迎交流。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 利用python-docx批量處理Word文件——表格利用python-docx批量處理Word文件——表格（二）樣式控制使用python-docx處理word.docx文件（1）使用python-docx處理word.docx文件（2）使用python-docx處理word.docx文件（4）使用python-docx處理word.docx文件（5）使用python-docx處理word.docx文件（3） python-docx操作word文件（*.docx） python-docx處理Word必備工具 Python用python-docx讀寫word文檔