[python爬蟲] Selenium定向爬取PubMed生物醫學摘要信息

本文轉載自查看原文 2015-12-18 03:00 3842 【Python爬蟲】

本文主要是自己的在線代碼筆記。在生物醫學本體Ontology構建過程中，我使用Selenium定向爬取生物醫學PubMed數據庫的內容。
PubMed是一個免費的搜尋引擎，提供生物醫學方面的論文搜尋以及摘要。它的數據庫來源為MEDLINE（生物醫學數據庫），其核心主題為醫學，但亦包括其他與醫學相關的領域，像是護理學或者其他健康學科。它同時也提供對於相關生物醫學資訊上相當全面的支援，像是生化學與細胞生物學。
PubMed是因特網上使用最廣泛的免費MEDLINE，該搜尋引擎是由美國國立醫學圖書館提供，它是基於WEB的生物醫學信息檢索系統，它是NCBI Entrez整個數據庫查詢系統中的一個。PubMed界面提供與綜合分子生物學數據庫的鏈接，其內容包括：DNA與蛋白質序列，基因圖數據，3D蛋白構象，人類孟德爾遺傳在線，也包含着與提供期刊全文的出版商網址的鏈接等。
醫學導航鏈接：http://www.meddir.cn/cate/736.htm
PubMed官網：http://pubmed.cn/

實現代碼

實現的代碼主要是Selenium通過分析網頁DOM結點進行爬取。
爬取的地址是：http://www.medlive.cn/pubmed/
在網址中搜索Protein（蛋白質）后，分析網址可發現設置Page=1~20可爬取前1~20頁的URL信息。鏈接如下：
http://www.medlive.cn/pubmed/pubmed_search.do?q=protein&page=1

 1 # coding=utf-8
 2 """ 
 3 Created on 2015-12-05  Ontology Spider
 4 @author Eastmount CSDN
 5 URL:
 6   http://www.meddir.cn/cate/736.htm
 7   http://www.medlive.cn/pubmed/
 8   http://paper.medlive.cn/literature/1502224
 9 """
10 
11 import time          
12 import re          
13 import os
14 import shutil
15 import sys
16 import codecs 
17 from selenium import webdriver      
18 from selenium.webdriver.common.keys import Keys      
19 import selenium.webdriver.support.ui as ui      
20 from selenium.webdriver.common.action_chains import ActionChains  
21 
22 #Open PhantomJS
23 driver = webdriver.Firefox()
24 driver2 = webdriver.PhantomJS(executable_path="G:\phantomjs-1.9.1-windows\phantomjs.exe") 
25 wait = ui.WebDriverWait(driver,10)
26 
27 '''
28  Load Ontoloty
29  去到每個生物本體頁面下載摘要信息
30  http://paper.medlive.cn/literature/literature_view.php?pmid=26637181
31  http://paper.medlive.cn/literature/1526876
32 '''
33 def getAbstract(num,title,url):
34     try:
35         fileName = "E:\\PubMedSpider\\" + str(num) + ".txt"
36         #result = open(fileName,"w")
37         #Error: 'ascii' codec can't encode character u'\u223c'
38         result = codecs.open(fileName,'w','utf-8') 
39         result.write("[Title]\r\n")
40         result.write(title+"\r\n\r\n")
41         result.write("[Astract]\r\n")
42         driver2.get(url)
43         elem = driver2.find_element_by_xpath("//div[@class='txt']/p")
44         #print elem.text
45         result.write(elem.text+"\r\n")
46     except Exception,e:    
47         print 'Error:',e
48     finally:
49         result.close()
50         print 'END\n'
51 
52 '''
53  循環獲取搜索頁面的URL
54  規律 http://www.medlive.cn/pubmed/pubmed_search.do?q=protein&page=1
55 '''
56 def getURL():
57     page = 1      #跳轉的頁面總數
58     count = 1     #統計所有搜索的生物本體個數    
59     while page<=20:
60         url_page = "http://www.medlive.cn/pubmed/pubmed_search.do?q=protein&page="+str(page)
61         print url_page
62         driver.get(url_page)
63         elem_url = driver.find_elements_by_xpath("//div[@id='div_data']/div/div/h3/a")
64         for url in elem_url:
65             num = "%05d" % count
66             title = url.text
67             url_content = url.get_attribute("href")
68             print num
69             print title
70             print url_content
71             #自定義函數獲取內容
72             getAbstract(num,title,url_content)
73             count = count + 1
74         else:
75             print "Over Page " + str(page) + "\n\n"
76         page = page + 1
77     else:
78         "Over getUrl()\n"
79         time.sleep(5)
80 
81 '''
82  主函數預先運行
83 '''
84 if __name__ == '__main__':
85     path = "F:\\MedSpider\\"
86     if os.path.isfile(path):         #Delete file
87         os.remove(path)
88     elif os.path.isdir(path):        #Delete dir    
89         shutil.rmtree(path, True)    
90     os.makedirs(path)                #Create the file directory
91     getURL()
92     print "Download has finished."

分析HTML

1.獲取每頁Page中的20個關於Protein（蛋白質）的URL鏈接和標題。其中getURL()函數中的核心代碼獲取URL如下：
  url = driver.find_elements_by_xpath("//div[@id='div_data']/div/div/h3/a")
  url_content = url.get_attribute("href")
  getAbstract(num,title,url_content)

2.再去到具體的生物文章頁面獲取摘要信息

其中你可能遇到的錯誤包括：
1.Error: 'ascii' codec can't encode character u'\u223c'
它是文件讀寫編碼錯誤，我通常會將open(fileName,"w")改為codecs.open(fileName,'w','utf-8') 即可。
2.第二個錯誤如下圖所示或如下，可能是因為網頁加載或Connection返回Close導致
WebDriverException: Message: Error Message => 'URL ' didn't load. Error: 'TypeError: 'null' is not an object

運行結果

得到的運行結果如下所示：00001.txt~00400.txt共400個txt文件，每個文件包含標題和摘要，該數據集可簡單用於生物醫學的本體學習、命名實體識別、本體對齊構建等。

PS：最后也希望這篇文章對你有所幫助吧！雖然文章內容很簡單，但是對於初學者或者剛接觸爬蟲的同學來說，還是有一定幫助的。同時，這篇文章更多的是我的個人在線筆記，簡單記錄下一段代碼，以后也不會再寫Selenium這種簡單的爬取頁面的文章了，更多是一些智能動態的操作和Scrapy、Python分布式爬蟲的文章吧。如果文中有錯誤和不足之處，還請海涵~昨天自己生日，祝福自己，老師夢啊老師夢！！！
（By:Eastmount 2015-12-06 深夜3點半 http://blog.csdn.net/eastmount/）

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python-淘寶信息定向爬取 [python爬蟲] Selenium定向爬取海量精美圖片及搜索引擎雜談 [python爬蟲] Selenium定向爬取虎撲籃球海量精美圖片用Python實現一個爬蟲爬取ZINC網站進行生物信息學數據分析 Python爬蟲之selenium爬蟲，模擬瀏覽器爬取天貓信息生物醫學工程SCI期刊投稿（轉） Python爬蟲爬取ECVA論文標題作者摘要關鍵字等信息並存儲到mysql數據庫【Python爬蟲】之爬取頁面內容、圖片以及用selenium爬取 python爬蟲爬取全球機場信息 Python爬蟲學習(三) ——————爬取外賣信息