[Selenium2+python2.7][Scrap]爬蟲和selenium方式下拉滾動條獲取簡書作者目錄並且生成Markdown格式目錄


預計閱讀時間: 15分鍾

環境: win7 + Selenium2.53.6+python2.7 +Firefox 45.2  (具體配置參考 http://www.cnblogs.com/yoyoketang/p/selenium.html)

FF45.2 官方下載地址: http://ftp.mozilla.org/pub/firefox/releases/45.2.0esr/win64/en-US/ 

痛點:爸爸的一個朋友最近簡書上面更新了20多篇文章,讓我添加目錄。每次手動查找鏈接再添加標題太麻煩了,30多篇就需要半個多小時,而且鏈接可能會變換。

解決辦法:由於簡書支持markdown 格式,爬取作者目錄然后生成Markdown格式文檔即可

 

原始思路一: 采用urllib2方式爬取目錄

步驟:

1.使用urllib2模擬header  request打開頁面

2. 采用正則匹配href的鏈接,然后用列表推導式生成鏈接

3. 采用正則獲取標題

4. 生成目錄

 1 #coding=utf-8
 2 import urllib2,re
 3 
 4 def getHtml(url):
 5     header = {"User-Agent":'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.101 Safari/537.36'}
 6     request = urllib2.Request(url,headers=header)  #init user request with url and headers
 7     response = urllib2.urlopen(request)            #open url
 8     text = response.read()
 9     return text
10 
11 def getTitleLink(html):
12     pattern1 = re.compile('<a class="title" target="_blank" href="/p/(\w{0,12})"', re.S)
13     links = re.findall(pattern1,html)
14     urls = ["www.jianshu.com/p/"+str(link) for link in links]
15 
16     pattern2 = re.compile('<a class="title" target="_blank" href="/p/.*?">(.*?)</a>',re.S)
17     titles = re.findall(pattern2,html)
18     for title,url in zip(titles,urls):
19         if r'目錄' not in title:
20             print "["+title+"](" + url + ")"
21     #return urls
22 
23 
24 #sample test menu
25 url = 'http://www.jianshu.com/u/73632348f37a'
26 html = getHtml(url)
27 getTitleLink(html)

測試發現如果作者文章只有五六篇,能正確生成。

但是如果文章20篇以上,發現問題:

這種辦法只爬取了當前頁面加載的文章鏈接,手工拖拽滾動條動態加載的標題內容無法直接獲取到,網上建議用selenium來解決

 

思路二: 采用selenium打開網頁,調用js模擬鼠標點擊滾動條,加載全部頁面

步驟:

1. 使用selenium打開網頁

2. 循環調用js模擬鼠標點擊下拉滾動條,直至加載全部頁面

3. 使用find_elements_by_xpath查找標題tag

4. 將標題tag解析后寫入目錄並打印

注: 步驟3獲取的為WebElement 類型對象

 1 #coding=utf-8
 2 
 3 #refer to http://www.cnblogs.com/haigege/p/5492177.html
 4 #Step1: scroll and generate Markdown format Menu
 5 
 6 from selenium import webdriver
 7 import time
 8 
 9 def scroll_top(driver):
10     if driver.name == "chrome":
11         js = "var q=document.body.scrollTop=0"
12     else:
13         js = "var q=document.documentElement.scrollTop=0"
14     return driver.execute_script(js)
15 
16 # 拉到底部
17 def scroll_foot(driver):
18     if driver.name == "chrome":
19         js = "var q=document.body.scrollTop=100000"
20     else:
21         js = "var q=document.documentElement.scrollTop=100000"
22     return driver.execute_script(js)
23 
24 def write_text(filename, info):
25     """
26     :param info: 要寫入txt的文本內容
27     :return: none
28     """
29     # 創建/打開info.txt文件,並寫入內容
30     with open(filename, 'a+') as fp:
31         fp.write(info.encode('utf-8'))
32         fp.write('\n'.encode('utf-8'))
33         fp.write('\n'.encode('utf-8'))
34 
35 def sroll_multi(driver,times=5,loopsleep=2):
36     #40 titles about 3 times
37     for i in range(times):
38         time.sleep(loopsleep)
39         print "Scroll foot %s time..." % i
40         scroll_foot(driver)
41     time.sleep(loopsleep)
42 
43 #Note: titles is titles_WebElement type object
44 def write_menu(filename,titles):
45     with open(filename, 'w') as fp:
46         pass
47     for title in titles:
48         if r'目錄' not in title.text:
49             print "[" + title.text + "](" + title.get_attribute("href") + ")"
50             t = title.text.encode('utf-8')
51             t = title.text.replace(":", "")
52             t = title.text.replace("|", "")
53             t = title.text.decode('utf-8')
54             write_text(filename, "[" + t + "](" + title.get_attribute("href") + ")")
55             #assert type(title) == "WebElement"
56             #print type(title)
57 
58 def main(url):
59     # eg. <a class="title" href="/p/6f543f43aaec" target="_blank"> titleXXX</a>
60     driver = webdriver.Firefox()
61     driver.implicitly_wait(10)
62     # driver.maximize_window()
63     driver.get(url)
64     sroll_multi(driver)
65     titles = driver.find_elements_by_xpath('.//a[@class="title"]|.//a[target="_blank"]')
66     write_menu(filename, titles)
67 
68 if __name__ == '__main__':
69     # sample link
70     url = 'http://www.jianshu.com/u/73632348f37a'
71     filename = r'info.txt'
72     main(url)

 

注:

1. 參考鏈接: http://www.cnblogs.com/haigege/p/5492177.html

2. 環境下載:Firefox45: https://ftp.mozilla.org/pub/firefox/releases/45.0esr/win64/en-US/

3. 如果編碼格式報錯,添加

reload(sys)
sys.setdefaultencoding('utf8')


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM