selenium+BeautifulSoup+phantomjs爬取新浪新聞

本文轉載自查看原文 2016-01-20 14:04 3392 selenium/ phantomjs/ BeautifulSoup

一下載phantomjs，把phantomjs.exe的文件路徑加到環境變量中，也可以phantomjs.exe拷貝到一個已存在的環境變量路徑中，比如我用的anaconda，我把phantomjs.exe文件加入到了Anaconda3這個文件夾中（Anaconda3已加入環境變量）

二 pip安裝selenium+BeautifulSoup+phantomjs 命令pip install selenium，anaconda中已有BeautifulSoup，不用管

三爬取數據，目標是爬取新浪新聞下的公司下面的所有的新聞文本。如圖是新聞文章的列表，我們首先要抓取文章對用的鏈接，然后進入鏈接抓取文本

由於采用的是js加載的，如果直接用beautifulsoup是解析不出的，這里采用selenium+phantomjs抓取。抓取的思路是首先模擬點擊公司新聞按鈕，進入公司新聞欄目下，抓取該頁所有新聞文章對應的鏈接，然后點擊模擬點擊下一頁進入下一頁循環抓取

下面是粗糙的代碼實現：

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import time
def get_links(driver):
    '''
    爬取鏈接並寫入txt中
    '''
    t1 = time.time()
    try:
        driver.find_element(By.LINK_TEXT, "下一頁").click()#每爬取完一頁點擊下一頁
    except NoSuchElementException:
        time.sleep(1)
        driver.find_element(By.LINK_TEXT, "下5頁").click()#有可能遇到沒有下一頁，嘗試點擊下5頁
    time.sleep(1)
    bs = BeautifulSoup(driver.page_source)#不知道怎么用selenium直接解析出href。把selenium的webdriver調用page_source函數在傳入BeautifulSoup中，就可以用BeautifulSoup解析網頁了
    links = []
    for i in bs.findAll('a',href=re.compile("http://finance.sina.com.cn/chanjing/gsnews/.")):#用正則表達式找出所有需要的鏈接
        link = i.get('href')
        if link not in links:#去掉重復鏈接
            links.append(link)
            f.write(link+'\n')
    t2 = time.time()
    page_num = bs.find('span',{'class','pagebox_num_nonce'}).text#找出當前頁數
    page_num = int(page_num)
    if page_num>4:
        return
    print('爬取完第%d頁,用時%d秒'%(page_num,t2-t1))
    get_links(driver)
    
def get_text(links,path):
    '''
    解析出所需文本，第一個參數為鏈接列表，第二個為保存路徑
    '''
    n=0
    for link in links:
        html = urlopen(link)
        bsObj = BeautifulSoup(html)
        temp = ''
        try:
            for link in bsObj.find("div",{'id':re.compile('artibody')}).findAll('p'):
                temp = temp+link.text.strip()#把每一段都拼接在一起
            print(temp[:31])
            path.write(temp+'\n')
            n+=1
            print('爬取完第%d篇'%n)
            print('\n')
        except (AttributeError,UnicodeEncodeError,UnicodeEncodeError):#這里的處理可能有點暴力
            continue
            
if True:#我把爬取的鏈接保存了下，所分成了兩部，第一次爬取鏈接，第二次爬取文本  
    f = open('E:\hei.txt','w')
    driver = webdriver.PhantomJS()#如果phantomjs.exe所在路徑沒有加入環境變量，這里也可以直接把其路徑作為參數傳給PhantomJS()
    driver.get("http://finance.sina.com.cn/chanjing/")
    driver.find_element(By.LINK_TEXT, "公司新聞").click()
    time.sleep(2)
    get_links(driver)
    f.close()
    driver.close()
    
if True:#爬取文本  
    xl = open('E:\heiii.txt','w')
    with open('E:\heii.txt') as f:
        links = [link.strip() for link in f]
    get_text(links,xl)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用scrapy爬蟲,爬取今日頭條首頁推薦新聞（scrapy+selenium+PhantomJS）使用 BeautifulSoup 和 Selenium 進行網頁爬取用requests庫和BeautifulSoup4庫爬取新聞列表 Python 利用 BeautifulSoup 爬取網站獲取新聞流 python3爬蟲-爬取新浪新聞首頁所有新聞標題 Phantomjs與Selenium爬取圖片利用BeautifulSoup抓取新浪網頁新聞的內容 Selenium+PhantomJS自動化登錄爬取博客文章學習用java基於webMagic+selenium+phantomjs實現爬蟲Demo爬取淘寶搜索頁面 python3.8通過python selenium+requests+BeautifulSoup+ BrowserMobProxy對頁面進行徹底爬取