網頁爬蟲--python3.6+selenium+BeautifulSoup實現動態網頁的數據抓取，適用於對抓取頻率不高的情況

本文轉載自查看原文 2018-12-28 12:01 637 網頁爬蟲

說在前面：本文主要介紹如何抓取頁面加載后需要通過JS加載的數據和圖片

本文是通過python中的selenium（pyhton包） + chrome（谷歌瀏覽器） + chromedrive（谷歌瀏覽器驅動）

chrome 和chromdrive建議都下最新版本（參考地址：https://blog.csdn.net/yoyocat915/article/details/80580066）

同樣支持無頭模式（不需要打開瀏覽器）

直接上代碼：site_url：需要爬取的地址，CHROME_DRIVER_PATH：chromedrive存放地址

 1 def get_dynamic_html(site_url):
 2     print('開始加載',site_url,'動態頁面')
 3     chrome_options = webdriver.ChromeOptions()
 4     #ban sandbox
 5     chrome_options.add_argument('--no-sandbox')
 6     chrome_options.add_argument('--disable-dev-shm-usage')
 7     #use headless，無頭模式
 8     chrome_options.add_argument('--headless')
 9     chrome_options.add_argument('--disable-gpu')
10     chrome_options.add_argument('--ignore-ssl-errors')
11     driver = webdriver.Chrome(executable_path=CHROME_DRIVER_PATH,chrome_options=chrome_options)
12     #print('dynamic laod web is', site_url)
13     driver.set_page_load_timeout(100)
14     #driver.set_script_timeout(100)
15     try:
16         driver.get(site_url)
17     except Exception as e:
18         #driver.execute_script('window.stop()')  # 超出時間則不加載
19         print(e, 'dynamic web load timeout')
20     data = driver.page_source
21     soup = BeautifulSoup(data, 'html.parser')
22     try:
23         driver.quit()
24     except:
25         pass
26     return soup

返回的一個soup，這樣可以對這個soup進行搜索節點，使用select，search，find等方法找到你想要的節點或者數據

同樣如果你想變成文本下載下來，則

1 try:
2         with open(xxx.html, 'w+', encoding="utf-8") as f:
3             #print ('html content is:',content)
4             f.write(get_dynamic_html('https://xxx.com').prettify())
5             f.close()
6     except Exception as e:
7         print(e)

下面詳細說一下，beautifusoup的搜索

首先如何定位到一個標簽

1.使用 find （這位博主詳細介紹了https://www.jb51.net/article/109782.htm）

find() 返回匹配第一個：如soup.find(name='ul',attrs={class:'hh'}) 返回第一個 class='hh'的ul
find_all() 返回全部
find_parent() 搜索父標簽，返回第一個
find_parents()搜索父標簽，返回全部
find_next_sibling()返回下一個同級標簽
find_next_siblings()
find_previous_sibling() 返回上一個同級標簽
find_previous()返回前面的標簽
find_all_previous()
find_next()返回后面的標簽
find_all_next()

           2.使用select 
         
            通過標簽名，類名，id 類似 Jquery的選擇器 如 soup.select('p .link #link1') 選擇定位到 <p class='link' id='link1'></p> 
         
           通過屬性查找 ，如href ，title，link等屬性，如  soup.select('p a[href="http://example.com/elsie"]') 
         
           這里匹配到的是最小的 <a href='http://example.com/elsie'></a> 並且他的上級為<p></p> 
         
           然后說一下 對節點的操作 
         
           　　刪除節點tag.decompose() 
         
           　　在指定位置 插入子節點 tag.insert(0,chlid_tag) 
         
           最后通過beautifusoup是篩選元素的一種好的方法，下篇我們介紹正則表達式匹配篩選 爬蟲內容

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲進階之Selenium和chromedriver,動態網頁（Ajax）數據抓取爬蟲selenium動態網頁數據抓取 Python開發爬蟲之動態網頁抓取篇：爬取博客評論數據——通過Selenium模擬瀏覽器抓取 scrapy和selenium結合抓取動態網頁在python使用selenium獲取動態網頁信息並用BeautifulSoup進行解析--動態網頁爬蟲 python網絡爬蟲抓取動態網頁並將數據存入數據庫MySQL 如何實時抓取動態網頁數據？ java簡單實現抓取動態網頁數據 Scrapy抓取動態網頁 Chrome + Python 抓取動態網頁內容