代碼部分:
1 from bs4 import BeautifulSoup 2 3 #下面代碼示例都是用此文檔測試 4 html_doc = """ 5 <html><head><title>The Dormouse's story</title></head> 6 <body> 7 <p class="title"><b>The Dormouse's story</b></p> 8 9 <p class="story">Once upon a time there were three little sisters; and their names were 10 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 11 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 12 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 13 and they lived at the bottom of a well.</p> 14 15 <p class="story">...</p> 16 """ 17 soup = BeautifulSoup(html_doc,'lxml') 18 print("1;獲取head標簽") 19 print(soup.head) 20 print("2;#獲取p節點下的b節點") 21 print(soup.p.b) 22 #name屬性獲取節點名稱: 23 print("4;name屬性獲取節點名稱") 24 print(soup.body.name) 25 #attrs屬性獲取節點屬性,也可以字典的形式直接獲取,返回的結果可能是列表或字符串類型,取決於節點類型 26 print("5;獲取p節點所有屬性") 27 print(soup.p.attrs) 28 print("6;獲取p節點class屬性") 29 print(soup.p.attrs['class']) 30 print("7;直接獲取p節點class屬性") 31 print(soup.p['class']) 32 #string屬性獲取節點元素包含的文本內容: 33 print("8;獲取a標簽下的文本,只獲取第一個") 34 print(soup.p.string) 35 #contents屬性獲取節點的直接子節點,以列表的形式返回內容 36 print("9;contents屬性獲取節點的直接子節點,以列表的形式返回內容") 37 print(soup.body.contents) 38 #children屬性獲取的也是節點的直接子節點,只是以生成器的類型返回 39 print("10;children屬性獲取的也是節點的直接子節點,只是以生成器的類型返回") 40 print(soup.body.children) 41 #descendants屬性獲取子孫節點,返回生成器 42 print("11;descendants屬性獲取子孫節點,返回生成器") 43 print(soup.body.descendants) 44 #parent屬性獲取父節點,parents獲取祖先節點,返回生成器 45 print("12;parent屬性獲取父節點,parents獲取祖先節點,返回生成器") 46 print(soup.b.parent) 47 print(soup.b.parents) 48 #next_sibling屬性返回下一個兄弟節點 49 print("13;next_sibling屬性返回下一個兄弟節點") 50 print(soup.a.next_sibling) 51 #previous_sibling返回上一個兄弟節點,注意換行符也是一個節點 52 print("14;previous_sibling返回上一個兄弟節點,注意換行符也是一個節點") 53 print(soup.a.previous_sibling) 54 #next_siblings屬性返回下所有兄弟節點 55 print("15;next_sibling屬性返回下一個兄弟節點") 56 print(soup.a.next_siblings) 57 #previous_siblings返回上所有兄弟節點,注意換行符也是一個節點 58 print("16;previous_sibling返回上一個兄弟節點,注意換行符也是一個節點") 59 print(soup.a.previous_siblings) 60 #next_element和previous_element屬性獲取下一個被解析的對象,或者上一個 61 print("17;next_element和previous_element屬性獲取下一個被解析的對象,或者上一個") 62 print(soup.a.next_element) 63 print(soup.a.previous_element) 64 #next_elements和previous_elements迭代器向前或者后訪問文檔解析內容 65 print("18;next_elements和previous_elements迭代器向前或者后訪問文檔解析內容") 66 print(soup.a.next_elements) 67 print(soup.a.previous_elements)
運行結果:
/home/aaron/桌面/Python3-Test/venv/bin/python /home/aaron/桌面/Python3-Test/bs4-study.py 1;獲取head標簽 <head><title>The Dormouse's story</title></head> 2;#獲取p節點下的b節點 <b>The Dormouse's story</b> 4;name屬性獲取節點名稱 body 5;獲取p節點所有屬性 {'class': ['title']} 6;獲取p節點class屬性 ['title'] 7;直接獲取p節點class屬性 ['title'] 8;獲取a標簽下的文本,只獲取第一個 The Dormouse's story 9;contents屬性獲取節點的直接子節點,以列表的形式返回內容 ['\n', <p class="title"><b>The Dormouse's story</b></p>, '\n', <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p>, '\n', <p class="story">...</p>, '\n'] 10;children屬性獲取的也是節點的直接子節點,只是以生成器的類型返回 <list_iterator object at 0x7f0b1bd17750> 11;descendants屬性獲取子孫節點,返回生成器 <generator object Tag.descendants at 0x7f0b19e17d50> 12;parent屬性獲取父節點,parents獲取祖先節點,返回生成器 <p class="title"><b>The Dormouse's story</b></p> <generator object PageElement.parents at 0x7f0b19e17d50> 13;next_sibling屬性返回下一個兄弟節點 , 14;previous_sibling返回上一個兄弟節點,注意換行符也是一個節點 Once upon a time there were three little sisters; and their names were 15;next_sibling屬性返回下一個兄弟節點 <generator object PageElement.next_siblings at 0x7f0b19e17d50> 16;previous_sibling返回上一個兄弟節點,注意換行符也是一個節點 <generator object PageElement.previous_siblings at 0x7f0b19e17d50> 17;next_element和previous_element屬性獲取下一個被解析的對象,或者上一個 Elsie Once upon a time there were three little sisters; and their names were 18;next_elements和previous_elements迭代器向前或者后訪問文檔解析內容 <generator object PageElement.next_elements at 0x7f0b19e17d50> <generator object PageElement.previous_elements at 0x7f0b19e17d50> Process finished with exit code 0