python抽取指定url頁面的title方法

本文轉載自查看原文 2018-06-29 09:50 1527

今天簡單使用了一下python的re模塊和lxml模塊，分別利用的它們提供的正則表達式和xpath來解析頁面源碼從中提取所需的title，xpath在完成這樣的小任務上效率非常好，在這里之所以又使用了一下正則表達式是因為xpath在處理一些特殊的頁面的時候會出現亂碼的情況，當然這不是xpath的原因，而是頁面本身編碼，跟utf-8轉碼之間有沖突所致，

這里看代碼：
# !/usr/bin/python
#-*-coding:utf-8-*-
'''
功能：抽取指定url的頁面內容中的title
'''
import re
import chardet
import urllib
from lxml import etree
def utf8_transfer(strs):
'''
utf8編碼轉換
'''
try:
if isinstance(strs, unicode):
   strs = strs.encode('utf-8')
elif chardet.detect(strs)['encoding'] == 'GB2312':
   strs = strs.decode("gb2312", 'ignore').encode('utf-8')
elif chardet.detect(strs)['encoding'] == 'utf-8':
   strs = strs.decode('utf-8', 'ignore').encode('utf-8')
except Exception, e:
print 'utf8_transfer error', strs, e
return strs
def get_title_xpath(Html):
'''
用xpath抽取網頁Title
'''
Html = utf8_transfer(Html)
Html_encoding = chardet.detect(Html)['encoding']
page = etree.HTML(Html, parser=etree.HTMLParser(encoding=Html_encoding))
title = page.xpath('/html/head/title/text()')
try:
title = title[0].strip()
except IndexError:
print 'Nothing'
print title
def get_title(Html):
'''
用re抽取網頁Title
'''
Html = utf8_transfer(Html)
compile_rule = ur''
title_list = re.findall(compile_rule, Html)
if title_list == []:
title = ''
else:
title = title_list[0][7:-8]
print title
if __name__ == '__main__':
    url = 'http://www.baidu.com'
    html = urllib.urlopen(url).read()
    new_html = utf8_transfer(html)
    try:
        get_title_xpath(new_html)
        get_title(new_html)
    except Exception, e:
        print e
下面是結果：
百度一下，你就知道
百度一下，你就知道
簡單的小實踐，繼續學習，歡迎交流。
以上這篇python抽取指定url頁面的title方法就是小編分享給大家的全部內容了，希望能給大家一個參考

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python+Selenium學習--打印當前頁面的title及url python 獲取指定字符前面或后面的所有字符 Vue設置頁面的title vue設置每個頁面的頭部title python爬取指定新聞 iOS 導航欄返回到指定頁面的方法和理解 spring獲取指定包下面的所有類 JAVA從URL參數鏈接中獲取指定參數的值 PyQt（Python+Qt）學習隨筆：QTableWidget的獲取指定位置項的item和itemAt方法 PyQt（Python+Qt）學習隨筆：QTableWidget的獲取指定位置項的item和itemAt方法