1.安裝Python環境
官網https://www.python.org/下載與操作系統匹配的安裝程序,安裝並配置環境變量
2.IntelliJ Idea安裝Python插件
我用的idea,在工具中直接搜索插件並安裝(百度)
3.安裝beautifulSoup插件
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#attributes
4.爬蟲程序:爬博客園的閃存內容
#!/usr/bin/python # -*- coding: UTF-8 -*- import urllib2 import time import bs4 '''ing.cnblogs.com爬蟲類''' class CnBlogsSpider: url = "https://ing.cnblogs.com/ajax/ing/GetIngList?IngListType=All&PageIndex=${pageNo}&PageSize=30&Tag=&_=" #獲取html def getHtml(self): request = urllib2.Request(self.pageUrl) response = urllib2.urlopen(request) self.html = response.read() #解析html def analyze(self): self.getHtml() bSoup = bs4.BeautifulSoup(self.html) divs = bSoup.find_all("div",class_='ing-item') for div in divs: img = div.find("img")['src'] item = div.find("div",class_='feed_body') userName = item.find("a",class_='ing-author').text text = item.find("span",class_='ing_body').text pubtime = item.find("a",class_='ing_time').text star = item.find("img",class_='ing-icon') and True or False print '( 頭像: ',img,'昵稱: ',userName,',閃存: ',text,',時間: ',pubtime,',星星: ',star,')' def run(self,page): pageNo = 1 while (pageNo <= page): self.pageUrl = self.url.replace('${pageNo}', str(pageNo))+str(int(time.time())) print '-------------\r\n第 ',pageNo,' 頁的數據如下:',self.pageUrl self.analyze() pageNo = pageNo + 1 CnBlogsSpider().run(3)
5.執行結果