預計閱讀時間: 15分鍾
環境: win7 + Selenium2.53.6+python2.7 +Firefox 45.2 (具體配置參考 http://www.cnblogs.com/yoyoketang/p/selenium.html)
FF45.2 官方下載地址: http://ftp.mozilla.org/pub/firefox/releases/45.2.0esr/win64/en-US/
痛點:爸爸的一個朋友最近簡書上面更新了20多篇文章,讓我添加目錄。每次手動查找鏈接再添加標題太麻煩了,30多篇就需要半個多小時,而且鏈接可能會變換。
解決辦法:由於簡書支持markdown 格式,爬取作者目錄然后生成Markdown格式文檔即可
原始思路一: 采用urllib2方式爬取目錄
步驟:
1.使用urllib2模擬header request打開頁面
2. 采用正則匹配href的鏈接,然后用列表推導式生成鏈接
3. 采用正則獲取標題
4. 生成目錄
1 #coding=utf-8 2 import urllib2,re 3 4 def getHtml(url): 5 header = {"User-Agent":'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.101 Safari/537.36'} 6 request = urllib2.Request(url,headers=header) #init user request with url and headers 7 response = urllib2.urlopen(request) #open url 8 text = response.read() 9 return text 10 11 def getTitleLink(html): 12 pattern1 = re.compile('<a class="title" target="_blank" href="/p/(\w{0,12})"', re.S) 13 links = re.findall(pattern1,html) 14 urls = ["www.jianshu.com/p/"+str(link) for link in links] 15 16 pattern2 = re.compile('<a class="title" target="_blank" href="/p/.*?">(.*?)</a>',re.S) 17 titles = re.findall(pattern2,html) 18 for title,url in zip(titles,urls): 19 if r'目錄' not in title: 20 print "["+title+"](" + url + ")" 21 #return urls 22 23 24 #sample test menu 25 url = 'http://www.jianshu.com/u/73632348f37a' 26 html = getHtml(url) 27 getTitleLink(html)
測試發現如果作者文章只有五六篇,能正確生成。
但是如果文章20篇以上,發現問題:
這種辦法只爬取了當前頁面加載的文章鏈接,手工拖拽滾動條動態加載的標題內容無法直接獲取到,網上建議用selenium來解決
思路二: 采用selenium打開網頁,調用js模擬鼠標點擊滾動條,加載全部頁面
步驟:
1. 使用selenium打開網頁
2. 循環調用js模擬鼠標點擊下拉滾動條,直至加載全部頁面
3. 使用find_elements_by_xpath查找標題tag
4. 將標題tag解析后寫入目錄並打印
注: 步驟3獲取的為WebElement 類型對象
1 #coding=utf-8 2 3 #refer to http://www.cnblogs.com/haigege/p/5492177.html 4 #Step1: scroll and generate Markdown format Menu 5 6 from selenium import webdriver 7 import time 8 9 def scroll_top(driver): 10 if driver.name == "chrome": 11 js = "var q=document.body.scrollTop=0" 12 else: 13 js = "var q=document.documentElement.scrollTop=0" 14 return driver.execute_script(js) 15 16 # 拉到底部 17 def scroll_foot(driver): 18 if driver.name == "chrome": 19 js = "var q=document.body.scrollTop=100000" 20 else: 21 js = "var q=document.documentElement.scrollTop=100000" 22 return driver.execute_script(js) 23 24 def write_text(filename, info): 25 """ 26 :param info: 要寫入txt的文本內容 27 :return: none 28 """ 29 # 創建/打開info.txt文件,並寫入內容 30 with open(filename, 'a+') as fp: 31 fp.write(info.encode('utf-8')) 32 fp.write('\n'.encode('utf-8')) 33 fp.write('\n'.encode('utf-8')) 34 35 def sroll_multi(driver,times=5,loopsleep=2): 36 #40 titles about 3 times 37 for i in range(times): 38 time.sleep(loopsleep) 39 print "Scroll foot %s time..." % i 40 scroll_foot(driver) 41 time.sleep(loopsleep) 42 43 #Note: titles is titles_WebElement type object 44 def write_menu(filename,titles): 45 with open(filename, 'w') as fp: 46 pass 47 for title in titles: 48 if r'目錄' not in title.text: 49 print "[" + title.text + "](" + title.get_attribute("href") + ")" 50 t = title.text.encode('utf-8') 51 t = title.text.replace(":", ":") 52 t = title.text.replace("|", "丨") 53 t = title.text.decode('utf-8') 54 write_text(filename, "[" + t + "](" + title.get_attribute("href") + ")") 55 #assert type(title) == "WebElement" 56 #print type(title) 57 58 def main(url): 59 # eg. <a class="title" href="/p/6f543f43aaec" target="_blank"> titleXXX</a> 60 driver = webdriver.Firefox() 61 driver.implicitly_wait(10) 62 # driver.maximize_window() 63 driver.get(url) 64 sroll_multi(driver) 65 titles = driver.find_elements_by_xpath('.//a[@class="title"]|.//a[target="_blank"]') 66 write_menu(filename, titles) 67 68 if __name__ == '__main__': 69 # sample link 70 url = 'http://www.jianshu.com/u/73632348f37a' 71 filename = r'info.txt' 72 main(url)
注:
1. 參考鏈接: http://www.cnblogs.com/haigege/p/5492177.html
2. 環境下載:Firefox45: https://ftp.mozilla.org/pub/firefox/releases/45.0esr/win64/en-US/
3. 如果編碼格式報錯,添加
reload(sys) sys.setdefaultencoding('utf8')