Python XML解析之ElementTree

本文轉載自查看原文 2018-09-29 14:22 3868 Python

參考網址：

http://www.runoob.com/python/python-xml.html

https://docs.python.org/2/library/xml.etree.elementtree.html

菜鳥教程提供了基本的XML編程接口DOM、SAX，以及輕量級ElementTree的簡易概念說明和一些示例。DOM是一種跨語言的XML解析機制，通過將整個XML在內存中解析為一個樹來操作，ElementTree未做太多介紹，你可以到官網網址查看其詳細的方法釋義。

ElementTree是Python中最快捷的XML解析方式，可以看做一個輕量級的DOM，本文主要講ElementTree，ElementTree在解析XML時非常方便，DOM比較笨重但是功能齊全，例如ElementTree處理XML注釋時就很不方便（詳見https://bugs.python.org/issue8277），此時用DOM比較好。

API名稱：

from xml.etree import ElementTree as ET

概念定義：

<country name="Liechtenstein">
    <rank>1</rank>
    <year>2008</year>
    <gdppc>141100</gdppc>
    <neighbor name="Austria" direction="E"/>
    <neighbor name="Switzerland" direction="W"/>
</country>
我們把<country>xxx</contry>這種結構稱為一個element，country稱作element的tag，<></>之間的內容稱作element的text或data，<>中的name稱作element的attrib，而整個XML樹被稱作ElementTree。
element是一個名為xml.etree.ElementTree.Element的類，其描述為：
class xml.etree.ElementTree.Element(tag, attrib={}, **extra)
此類的所有屬性和方法查看：
https://docs.python.org/2/library/xml.etree.elementtree.html#element-objects

方法釋義：

讀取XML數據：

--讀取XML文件
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
--讀取XML字符串
root = ET.fromstring(country_data_as_string)
--獲取element object的四大屬性tag、text、attrib以及tail
 root.tag #root element的tag
 root.text #root element的text
 root.attrib #root element本身的attrib,dict格式的
 root.tail #root element的tag結束到下一個tag之間的text
 --通過DICT邏輯獲取樹形結構的text，表示第一個child的第二個child element的text
 root[0][1].text

element object的方法:

Element.iter(tag) --遍歷當前element樹所有子節點的element（無論是子節點還是子節點的子節點）,找到符合指定tag名的所有element,如果tag為空則遍歷當前element樹，返回所有節點element(包含當前父節點)。2.7和3.2之前的版本無此方法，可以用getiterator()代替。
Element.findall(tag) --遍歷當前節點的直接子節點，找到符合指定tag名的element，返回由element組成的list
Element.find(tag) --遍歷當前節點的直接子節點，找到符合指定tag名的第一個element
Element.get(key) --在當前element中獲取符合指定attrib名的value
...其他方法參考官網

修改XML內容：

ElementTree.write(file, encoding="us-ascii", xml_declaration=None, default_namespace=None, method="xml")  --將之前的修改寫入XML
Element.set(key,value) --設置element attrib
Element.append(subelement) --新增一個子element，extends(subelements)是3.2的新增用法，輸入參數必須是一個element序列
Element.remove(subelement) --刪除指定tag的element
示例：
>>> for rank in root.iter('rank'):
...     new_rank = int(rank.text) + 1
...     rank.text = str(new_rank)
...     rank.set('updated', 'yes')
...
>>> tree.write('output.xml')

處理含有Namespaces的XML文件：

--有一個如下的XML字符串：
<?xml version="1.0"?>
<actors xmlns:fictional="http://characters.example.com"
        xmlns="http://people.example.com">
    <actor>
        <name>John Cleese</name>
        <fictional:character>Lancelot</fictional:character>
        <fictional:character>Archie Leach</fictional:character>
    </actor>
    <actor>
        <name>Eric Idle</name>
        <fictional:character>Sir Robin</fictional:character>
        <fictional:character>Gunther</fictional:character>
        <fictional:character>Commander Clement</fictional:character>
    </actor>
</actors>

其中包含fictional和default兩個命名空間，這意味fictional:xxx格式的tags、attributes都會被自動擴展為{uri}xxx格式。而如果還定義了默認命名空間xmlns,那么所有無前綴的tags也會被擴展為{url}xxx格式。

有兩種將此類XML處理為普通格式的方法：

方法一：在匹配時直接手動加上{uri}前綴
root = fromstring(xml_text)
for actor in root.findall('{http://people.example.com}actor'):
    name = actor.find('{http://people.example.com}name')
    print name.text
    for char in actor.findall('{http://characters.example.com}character'):
        print ' |-->', char.text
方法二：創建自己的namespace別名(其實只是在ns uri很長時可以少寫點，實質並沒有效率提升)
ns = {'real_person': 'http://people.example.com','role': 'http://characters.example.com'}
for actor in root.findall('real_person:actor', ns):
    name = actor.find('real_person:name', ns)
    print name.text
    for char in actor.findall('role:character', ns):
        print ' |-->', char.text
--兩種方式的輸出結果都是：
John Cleese
 |--> Lancelot
 |--> Archie Leach
Eric Idle
 |--> Sir Robin
 |--> Gunther
 |--> Commander Clement

一個比較proxool.xml文件的示例代碼：

# -*- coding:utf-8 -*-
# 用於進行配置文件的差異比較，2.7和3.2之前element沒有iter()的遍歷方法可以用getiterator()代替
import sys
from xml.etree import ElementTree as ET
from xml.dom import minidom
# 定義新舊XML文件分別為輸入參數1和2
old_file = sys.argv[1]
new_file = sys.argv[2]
# 定義將新增tag加入舊XML文件的方法
def modify_xml(old_file,new_file):
    if not new_file:
        sys.exit(0)
    tree_old = ET.parse(old_file) # 解析出整個ElementTree
    tree_new = ET.parse(new_file)
    global root # 定義全局變量root，只解析一次方便prettify_xml方法調用
    root = tree_old.getroot()
    root_old = tree_old.getroot().find("proxool")  # 定位舊XML父節點proxool
    root_new = tree_new.getroot().find("proxool")
    old_dict = {} # 定義舊XML文件的tag/text字典
    new_dict = {}
    for e in root_old.getiterator():  # 遍歷proxool樹的所有節點element，包含其作為父節點的自身
        # text為空時不能使用replace方法，因此加上判斷；if e.text不能排除空字符' '，只能過濾none和''因此加上strip()過濾
        if e.text and e.tag != 'proxool' and e.text.strip() != '':
            old_dict[e.tag] = e.text.replace("\n", "").replace("\t", "")
    for e in root_new.getiterator():
        if e.text and e.tag != 'proxool' and e.text.strip() != '':
            new_dict[e.tag] = e.text.replace("\n", "").replace("\t", "")
    # 至此新舊XML文件的tag/text已經作為字典的元素存在了old_dict和new_dict中，只要比較這兩個字典就可以拿到新增tag
    for tag,text in new_dict.items():
        if not old_dict.get(tag):  # 當舊XML中找不到對應的tag時,進行tag新增操作
            new_tag = ET.Element(tag) # 構造一個element
            new_tag.text = text # 設置此element的text
            root_old.append(new_tag) #將此element加入root_old節點下作為其子節點
        else:
            pass # 只為美觀，可以不寫else
    tree_old.write(old_file + "_fixed",encoding="UTF-8") # 最后將append的整個ElementTree寫入舊XML_fixed文件中，這樣注釋會丟失
# 新寫入的XML項不是那么美觀，再美化一下(發現結果更難看了，有待優化)
def prettify_xml(filename):
    strTree = ET.tostring(root) #使用全局變量root
    new_strTree = minidom.parseString(strTree).toprettyxml()
    with open(filename,'w') as output:
        output.write(new_strTree)
# 執行函數
modify_xml(old_file,new_file)
prettify_xml(old_file + "_fixed")

# Ps:后來發現使用ElementTree解析的XML文件很難美化，且不能處理注釋，所以轉用minidom處理XML文件了，詳見《Python XML解析之DOM》

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python XML文件解析：用ElementTree解析XML python xml.etree.ElementTree模塊生成、解析xml ZH奶酪：Python使用ElementTree解析XML【譯】 python的XML處理模塊ElementTree python處理xml的常用包（lib.xml、ElementTree、lxml） Python xml.etree.ElementTree讀寫xml文件實例 python ElementTree 輸出帶縮進格式的xml string Python之xml文檔及配置文件處理（ElementTree模塊、ConfigParser模塊）【Python】xml 解析 python 解析 xml