python中使用lxml與cssselect爬取電子書及鏈接

本文轉載自查看原文 2017-03-14 00:03 4891

---恢復內容開始---

在瀏覽這個網站（http://blog.jobbole.com/29281/）時，發現電子書不錯。

就想download下來，也正好在學習爬蟲，以下就用lxml及cssselect的方式下載下來，也當是個小練習。

1.download函數

import lxml.html

def download(url,user_agent='wswp',num_retires=2):
    print 'Downloading:' ,url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url,headers=headers)
    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print "Downloading error:", e.reason
        html = None
        if num_retires>0:
            if hasattr(e,'code') and 500<= e.code <600:
                return download(url, user_agent,num_retires-1)
    return html

2.抓取數據（注意加粗的cssselect的使用）

if __name__ == "__main__":
    url = 'http://blog.jobbole.com/29281/'
    html = download(url)
    for i in itertools.count(1):
        tree = lxml.html.fromstring(html)
        try:
            td = tree.cssselect('ol > li > a')[i]
            book = td.text_content()
            href = td.get('href')
            print book,href
        except:
            break

數據抓取完畢。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python爬蟲之電子書爬取 python爬取 “得到” App 電子書信息 python：根據小說名稱爬取電子書 python爬蟲學習01--電子書爬取行行網電子書多線程爬取 Python電子書 Python爬蟲入門教程 11-100 行行網電子書多線程爬取如何使用我的博客電子書爬取掌閱app免費電子書數據 python 300本電子書合集