下面我們創建一個真正的爬蟲例子
爬取我的博客園個人主頁首頁的推薦文章列表和地址
scrape_home_articles.py
from urllib.request import urlopen from bs4 import BeautifulSoup import re html = urlopen("http://www.cnblogs.com/davidgu") bsObj = BeautifulSoup(html, "html.parser") for link in bsObj.find("div", {"id":"main_container"}).findAll("a", href=re.compile("^http://www.cnblogs.com/davidgu/p")): if 'href' in link.attrs and not('class' in link.attrs): print(link.string) print(link.attrs['href']) print("--------------------------------------------------------------")
運行結果:
[置頂]解決adb server端口被占用的問題
http://www.cnblogs.com/davidgu/p/4515236.html
--------------------------------------------------------------
[置頂]解決Eclipse下不自動拷貝apk到模擬器問題( The connection to adb is down, and a sever
http://www.cnblogs.com/davidgu/p/4390661.html
--------------------------------------------------------------
常用的正則表達式一覽
http://www.cnblogs.com/davidgu/p/4831357.html
--------------------------------------------------------------
C++ 11 - STL - 函數對象(Function Object) (上)
http://www.cnblogs.com/davidgu/p/4829097.html
--------------------------------------------------------------
...