爬蟲解析之css,xpath語法

本文轉載自查看原文 2018-11-12 17:39 1084

一、xpath語法

xpath實例文檔

<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>

選取節點

XPath 使用路徑表達式在 XML 文檔中選取節點。節點是通過沿着路徑或者 step 來選取的。

下面列出了最有用的路徑表達式：

實例

在下面的表格中，我們已列出了一些路徑表達式以及表達式的結果：

謂語（Predicates）

謂語用來查找某個特定的節點或者包含某個指定的值的節點。

謂語被嵌在方括號中。

實例

在下面的表格中，我們列出了帶有謂語的一些路徑表達式，以及表達式的結果：

選取未知節點

XPath 通配符可用來選取未知的 XML 元素。

實例

在下面的表格中，我們列出了一些路徑表達式，以及這些表達式的結果：

選取若干路徑

通過在路徑表達式中使用“|”運算符，您可以選取若干個路徑。

實例

在下面的表格中，我們列出了一些路徑表達式，以及這些表達式的結果：

Xpath軸
軸可以定義相對於當前節點的節點集

練習

選取所有 title
下面的例子選取所有 title 節點：
/bookstore/book/title

選取第一個 book 的 title
下面的例子選取 bookstore 元素下面的第一個 book 節點的 title：
/bookstore/book[1]/title

下面的例子選取 bookstore 元素下面的第一個 book 節點的 title：
xml.setProperty("SelectionLanguage","XPath");
xml.selectNodes("/bookstore/book[1]/title");

選取所有價格
下面的例子選取 price 節點中的所有文本：
/bookstore/book/price/text()

選取價格高於 35 的 price 節點
下面的例子選取價格高於 35 的所有 price 節點：

/bookstore/book[price>35]/price

選取價格高於 35 的 title 節點
下面的例子選取價格高於 35 的所有 title 節點：

/bookstore/book[price>35]/title

二、CSS語法

提取內容

1) 按照審查元素的寫法不一定正確，要按照網頁源碼的才行

因為不一樣，網頁源碼才是你看到的

2) 瀏覽器有自帶的復制xpath功能，firefox下載firebug插件

3) xpath有c的速度，所以按照[@class=""]准確性較高

爬蟲實戰xpath和css

class DrugInfo(object):
    """
    提取的葯品信息：
        self.drug_name                      #葯品名稱
        self.category                       #葯品類型
        self.cite                           #國家標准
        self.company                        #生產廠家
        self.address                        #廠家地址
        self.license_number                 #批准文號
        self.approval_date                  #批准日期
        self.form_drug                      #劑型
        self.spec                           #規格
        self.store                          #儲存方法
        self.period_valid                   #有效期限
        self.attention_rank                 #關注度排名
        self.indication                     #適應症
        self.component                      #成分
        self.function                       #功能主治
        self.usage_dosage                   #用法用量
        self.contraindication               #禁忌症
        self.special_population             #特殊人群用葯
        self.indications                    #適應症概況
        self.is_or_not_medical_insurance    #是否屬於醫保
        self.is_or_not_infections           #是否有傳染性
        self.related_symptoms               #相關症狀
        self.related_examination            #相關檢查
        self.adverse_reaction               #不良反應
        self.attention_matters              #注意事項
        self.interaction                    #葯物相互作用
        self.pharmacological_action         #葯理作用
        self.revision_date                  #說明書修訂日期
        self.drug_use_consult               #用葯咨詢
        self.drug_use_experience            #用葯經驗

    """
    def __init__(self,drug):
        drug_dir = os.path.join(drug_path, drug)
        self.drug_name = re.findall('(.*?)\[\d+\]',drug)[0]
        self.drug_id = re.findall('.*?\[(\d+)\].*',drug)[0]
        self.drug_dir = drug_dir
        self.drug_use_experience = ''
        self.drug_use_consult = ''
        self.file_list = os.listdir(self.drug_dir)

        self.logger = Logger()

        self.result = True

        self.dispatch()
        if self.drug_use_consult.__len__()==0:self.drug_use_consult = '無'
        if self.drug_use_experience.__len__()==0:self.drug_use_experience = '無'

    def dispatch(self):
        for file in self.file_list:
            if file.endswith('葯品概述.html'):
                self.drug_summary(self.file_path(file))
            elif file.endswith('詳細說明書.html'):
                self.drug_instruction(self.file_path(file))
            elif re.match('.*?用葯咨詢.*',file):
                self.drug_consultation(self.file_path(file))
            elif re.match('.*?用葯經驗.*',file):
                self.drug_experience(self.file_path(file))
            else:
                self.result = False
                break

    def file_path(self,file):
        return os.path.join(self.drug_dir,file)

    def read_file(self,file):
        with open(file,'r') as f:
            html = f.read()
        return html

    def drug_summary(self,file):
        """葯品概況"""
        html = self.read_file(file)
        selector = Selector(text=html)
        self.category = selector.xpath('//div[@class="t1"]/cite[1]/span/text()').extract_first()    #葯品類型
        if not self.category:
            self.category = '未知'
        self.cite = selector.xpath('//div[@class="t1"]/cite[2]/span/text()').extract_first()    #國家標准
        if not self.cite:
            self.cite = '未知'
        try:
            self.company = selector.css('.t3 .company a::text').extract()[0]    #生產廠家
        except IndexError as e:
            self.company = '未知'
        try:
            self.address = selector.css('.t3 .address::text').extract()[0]  #廠家地址
        except IndexError as e:
            self.address = '未知'
        try:
            self.license_number = selector.xpath('//ul[@class="xxs"]/li[1]/text()').extract_first().strip() #批准文號
        except AttributeError:
            self.license_number = '未知'
        try:
            self.approval_date = selector.xpath('//ul[@class="xxs"]/li[2]/text()').extract_first().strip()  #批准日期
        except AttributeError:
            self.approval_date = '未知'
        try:
            self.form_drug = selector.xpath('//ul[@class="showlis"]/li[1]/text()').extract_first().strip()  #劑型
        except AttributeError:
            self.form_drug = '未知'
        try:
            self.spec = selector.xpath('//ul[@class="showlis"]/li[2]/text()').extract_first().strip()       #規格
        except AttributeError:
            self.spec = '未知'
        try:
            self.store = selector.xpath('//ul[@class="showlis"]/li[3]/text()').extract_first().strip().strip('。')     #儲存方法
        except AttributeError:
            self.store = '未知'
        try:
            self.period_valid = selector.xpath('//ul[@class="showlis"]/li[4]/text()').extract_first().strip('。').replace('\n','')   #有效期限
        except AttributeError:
            self.period_valid = '未知'
        self.attention_rank = selector.css('.guanzhu cite font::text').extract_first()  #關注度排名
        if not self.attention_rank:
            self.attention_rank = '未知'
        self.indication = ','.join(selector.css('.whatsthis li::text').extract())   #適應症
        if self.indication == '':
            self.indication = '未知'
        usage_dosage = selector.css('.ps p:nth-child(3)::text').extract_first()   #用法用量
        if usage_dosage:
            self.usage_dosage = re.sub('<.*?>','',usage_dosage).strip().replace('\n','')  #禁忌症
        else:
            self.usage_dosage = '未知'
        indications = selector.css('#diseaseintro::text').extract_first()  #適應症概況
        if indications:
            self.indications = re.sub('<.*?>','',indications).strip().replace('\n','')  #禁忌症
        else:
            self.indications = '未知'
        try:
            self.is_or_not_medical_insurance = selector.css('.syz_cons p:nth-child(2)::text').extract_first().split('：')[1] #是否屬於醫保
        except AttributeError as e:
            self.is_or_not_medical_insurance = '未知'
        try:
            self.is_or_not_infections = selector.css('.syz_cons p:nth-child(3)::text').extract_first().split('：')[1].strip()  #是否有傳染性
        except AttributeError as e:
            self.is_or_not_infections = '未知'
        self.related_symptoms = ','.join(selector.css('.syz_cons p:nth-child(4) a::text').extract()[:-1])      #相關症狀
        if len(self.related_symptoms) == 0:
            self.related_symptoms = '未知'
        self.related_examination = ','.join(selector.css('.syz_cons p:nth-child(5) a::text').extract()[:-1])    #相關檢查
        if len(self.related_examination) == 0:
            self.related_examination = '未知'

    def drug_instruction(self,file):
        """詳細說明書"""
        html = self.read_file(file)
        selector = Selector(text=html)
        #注：不同葯品之間網頁結構有差別，提取的時候應注意
        component = selector.xpath('//dt[text()="【成份】"]/following::*[1]').extract_first()
        if not component:
            self.component = '未知'
        else:
            self.component = re.sub('<.*?>','',component).strip()       #成分
        contraindication= selector.xpath('//dt[text()="【禁忌】"]/following::*[1]').extract_first()
        if contraindication:
            self.contraindication = re.sub('<.*?>','',contraindication).strip().replace('\n','')  #禁忌症
        else:
            self.contraindication = '未知'
        function = selector.xpath('//dt[text()="【功能主治】"]/following::*[1]').extract_first()
        if function:
            self.function = re.sub('<.*?>','',function).strip()         #功能主治
        else:
            self.function = '未知'

        try:
            self.adverse_reaction = selector.xpath('//dt[text()="【不良反應】"]/following::*[1]/p/text()').extract_first().strip('。')  #不良反應
        except AttributeError as e:
            try:
                self.adverse_reaction = selector.xpath('//dt[text()="【不良反應】"]/following::*[1]/text()').extract_first().strip('。')  #不良反應
                self.adverse_reaction = re.sub('<.*?>','',self.adverse_reaction).strip().replace('\n','')  #注意事項
            except AttributeError:
                self.adverse_reaction = '未知'
        attention_matters = selector.xpath('//dt[text()="【注意事項】"]/following::*[1]').extract_first()
        if attention_matters:
            self.attention_matters = re.sub('<.*?>','',attention_matters).strip().replace('\n','')  #注意事項
        else:
            self.attention_matters = '未知'
            self.logger.log('{}[{}]-注意事項為空'.format(self.drug_name,self.drug_id),False)
        try:
            self.interaction = selector.xpath('//dt[text()="【葯物相互作用】"]/following::*[1]/p/text()').extract_first()  #葯物相互作用
            self.interaction = re.sub('<.*?>','',self.interaction).strip().replace('\n','')  #注意事項
        except TypeError:
            self.interaction = '未知'
        try:
            self.pharmacological_action = selector.xpath('//dt[text()="【葯理作用】"]/following::*[1]/p/text()').extract_first()  #葯理作用
            self.pharmacological_action = re.sub('<.*?>','',self.pharmacological_action).strip().replace('\n','')
        except TypeError:
            self.pharmacological_action = '未知'
        try:
            self.revision_date = selector.xpath('//dt[text()="【說明書修訂日期】"]/following::*[1]/text()').extract_first().strip()  #說明書修訂日期
        except AttributeError:
            self.revision_date = '未知'
        try:
            self.special_population = selector.xpath('//dt[text()="【特殊人群用葯】"]/following::*[1]/text()').extract_first()  #特殊人群用葯
            self.special_population = re.sub('<.*?>','',self.special_population).strip().replace('\n','')  #特殊人群用葯
        except TypeError:
            self.special_population = '未知'

    def drug_consultation(self,file):
        """用葯咨詢"""
        html = self.read_file(file)
        selector = Selector(text=html)
        drug_use_consult = selector.css('.dpzx_con .zx p::text').extract()
        drug_use_consult = ''.join(drug_use_consult)
        drug_use_consult = re.sub('<.*?>','',drug_use_consult).strip().replace('\n','')  #用葯咨詢
        self.drug_use_consult += drug_use_consult

    def drug_experience(self,file):
        """用葯經驗"""
        html = self.read_file(file)
        selector = Selector(text=html)
        drug_use_experience = selector.css('.pls_box .pls_mid p::text').extract()
        drug_use_experience = ''.join(drug_use_experience)
        drug_use_experience = re.sub('<.*?>','',drug_use_experience).strip().replace('\n','')  #用葯經驗
        self.drug_use_experience += drug_use_experience.strip()

View Code

xapth的高級用法

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲xpath的語法 Python爬蟲之Xpath語法 Xpath語法-爬蟲(一) 爬蟲之解析庫Xpath Python爬蟲：Xpath語法筆記 XPath解析html及實例-使用xpath的爬蟲 python爬蟲數據解析之xpath Python爬蟲之Lxml庫與Xpath語法 python爬蟲：XPath語法和使用示例爬蟲之數據解析（bs4，Xpath）