python中使用lxml与cssselect爬取电子书及链接

本文转载自查看原文 2017-03-14 00:03 4891

---恢复内容开始---

在浏览这个网站（http://blog.jobbole.com/29281/）时，发现电子书不错。

就想download下来，也正好在学习爬虫，以下就用lxml及cssselect的方式下载下来，也当是个小练习。

1.download函数

import lxml.html

def download(url,user_agent='wswp',num_retires=2):
    print 'Downloading:' ,url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url,headers=headers)
    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print "Downloading error:", e.reason
        html = None
        if num_retires>0:
            if hasattr(e,'code') and 500<= e.code <600:
                return download(url, user_agent,num_retires-1)
    return html

2.抓取数据（注意加粗的cssselect的使用）

if __name__ == "__main__":
    url = 'http://blog.jobbole.com/29281/'
    html = download(url)
    for i in itertools.count(1):
        tree = lxml.html.fromstring(html)
        try:
            td = tree.cssselect('ol > li > a')[i]
            book = td.text_content()
            href = td.get('href')
            print book,href
        except:
            break

数据抓取完毕。

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 Python爬虫之电子书爬取 python爬取 “得到” App 电子书信息 python：根据小说名称爬取电子书 python爬虫学习01--电子书爬取行行网电子书多线程爬取 Python电子书 Python爬虫入门教程 11-100 行行网电子书多线程爬取如何使用我的博客电子书爬取掌阅app免费电子书数据 python 300本电子书合集