selenium實戰腳本集（2）——簡單的知乎爬蟲

本文轉載自查看原文 2015-04-08 12:59 1969 webdriver

背景

很多同學在工作中是沒有selenium的實戰環境的，因此自學的同學會感到有力無處使，想學習但又不知道怎么練習。其實學習新東西的道理都是想通的，那就是反復練習。這里乙醇會給出一些有用的，也富有挑戰的練習，幫助大家去快速掌握和使用selenium webdriver。多用才會有感觸。

練習

到http://www.zhihu.com/explore這個頁面，用selenium獲取今日最熱和本月最熱的文章標題和內容。

用到的知識點

爬蟲知識。用webdriver去也頁面上爬一些內容。用到的核心api是getAttribute；
如何跳轉到新頁面
觀察能力，有些時候跳轉tab是不需要點擊的

參考代碼

#ecoding: utf-8

"""
從zhihu.com獲取每日最熱和每月最熱
"""

from selenium import webdriver
from datetime import date

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

class Zhihu:
	def __init__(self):
		self.daily_url = 'https://www.zhihu.com/explore#daily-hot'
		self.monthly_url = 'https://www.zhihu.com/explore#monthly-hot'

	def __enter__(self):
		self.dr = webdriver.Firefox()
		return self

	def __exit__(self, p1, p2, p3):
		self.dr.quit()

	def get_daily_hots(self):
		result = []
		hots_urls = self.get_daily_hots_urls()
		for url in hots_urls:
			result.append(self.get_answer(url))
		return result

	def get_answer(self, url):
		self.dr.get(url)
		# wrap_div = self.dr.find_element_by_css_selector('.zm-item-answer.zm-item-expanded')
		article = {}
		article['question'] = self.dr.find_element_by_css_selector('#zh-question-title').text
		article['author'] = self.dr.find_element_by_css_selector('.author-link').text
		article['answer'] = self.dr.find_element_by_css_selector('.zm-editable-content.clearfix').get_attribute('innerHTML')

		return article

	def get_monthly_hots(self):
		pass

	def get_daily_hots_urls(self):
		self.dr.get(self.daily_url)
		wrap_div = self.dr.find_element_by_class_name('tab-panel')
		title_url_elements = wrap_div.find_elements_by_class_name('question_link')
		assert len(title_url_elements) == 5
		urls = []
		for title in title_url_elements:
			urls.append(title.get_attribute('href'))
		return urls

if __name__ == '__main__':
	with Zhihu() as zhihu:
		articles = zhihu.get_daily_hots()

視頻精講

pass

常見錯誤

這里有一個小技巧，就是獲取回答的時候其實是不需要打開新窗口的，如參考代碼所示
每月的熱點是不需要點擊tab頁的，直接通過url訪問就好
最好不要使用難以維護的xpath去定位，像/div[2]/span[1]/a[0]這種跟dom結構強相關的xpath就是難以維護的

挑戰

試着自己補充完成get_monthly_hots()方法，注意代碼的重用性

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲實戰（八）--------知乎爬蟲實戰(一)-新版知乎 selenium實戰腳本集（1）——新浪微博發送QQ每日焦點 Python爬蟲實戰，Scrapy實戰，爬取並簡單分析知網中國專利數據爬蟲入門實戰，知乎小爬蟲一個簡單的python爬蟲,爬取知乎爬蟲11天——selenium實戰簡單爬蟲項目實戰（一）【爬蟲】selenium動態頁面請求與模擬登錄知乎知乎使用selenium反爬蟲的解決方案