一、簡介
網頁三元素:
- html負責內容;
- css負責樣式;
- JavaScript負責動作;
從數據的角度考慮,網頁上呈現出來的數據的來源:
- html文件
- ajax接口
- javascript加載
如果用requests對一個頁面發送請求,只能獲得當前加載出來的部分頁面,動態加載的數據是獲取不到的,比如下拉滾輪得到的數據。selenium最初是一個自動化測試工具, 而爬蟲中使用它主要是為了解決requests無法直接執行JavaScript代碼的問題。selenium本質是通過驅動瀏覽器,完全模擬瀏覽器的操作,比如跳轉、輸入、點擊、下拉等,來拿到網頁渲染之后的結果,可支持多種瀏覽器。Selenium是python的一個第三方庫,對外提供的接口可以操作瀏覽器,然后讓瀏覽器完成自動化的操作。
selenium在爬蟲中的應用:
- 模擬登錄
- 便捷的獲取動態加載的數據
缺點:
- 爬取數據的效率底
- 環境部署繁瑣
二、環境安裝
- 下載安裝selenium:pip install selenium
- 下載瀏覽器驅動程序:http://chromedriver.storage.googleapis.com/index.html
- 查看驅動和瀏覽器版本的映射關系: https://www.cnblogs.com/Summer-skr--blog/p/11715259.html
查看谷歌瀏覽器版本
下載好以后,就有驅動程序了。
三、基本使用
1.瀏覽器創建
Selenium支持非常多的瀏覽器,如Chrome、Firefox、Edge等,還有Android、BlackBerry等手機端的瀏覽器。另外,也支持無界面瀏覽器PhantomJS。
from selenium import webdriver browser = webdriver.Chrome() browser = webdriver.Firefox() browser = webdriver.Edge() browser = webdriver.PhantomJS() browser = webdriver.Safari() browser.quit() # 關閉瀏覽器 browser.close() # 關閉當前頁面
close 只會關閉當前窗口,而 quit 退出驅動並會關閉所有的窗口。
2.打開網頁
browser.get(url) # 打開path路徑 page_text = browser.page_source # 獲取當前瀏覽器頁面的源碼數據
3.元素定位
查找一個元素(單節點)
element = find_element_by_id() element = find_element_by_name() element = find_element_by_class_name() element = find_element_by_tag_name() element = find_element_by_link_text() element = find_element_by_partial_link_text() element = find_element_by_xpath() element = find_element_by_css_selector()
查找多個元素(多節點)
element = find_elements_by_id() element = find_elements_by_name() element = find_elements_by_class_name() element = find_elements_by_tag_name() element = find_elements_by_link_text() element = find_elements_by_partial_link_text() element = find_elements_by_xpath() element = find_elements_by_css_selector()
注意:
(1)find_element_by_xxx第一個符合條件的標簽,find_elements_by_xxx找的是所有符合條件的標簽。
(2)根據ID、CSS選擇器和XPath獲取,它們返回的結果完全一致。
(3)另外,Selenium還提供了通用方法find_element(),它需要傳入兩個參數:查找方式By和值。實際上,它就是find_element_by_id()這種方法的通用函數版本,比如find_element_by_id(id)就等價於find_element(By.ID, id),二者得到的結果完全一致。
# 通過id定位 <html> <body> <form id="loginForm"> <input name="username" type="text" /> <input name="password" type="password" /> <input name="continue" type="submit" value="Login" /> </form> </body> <html> login_form = driver.find_element_by_id('loginForm')
# 通過name定位 <html> <body> <form id="loginForm"> <input name="username" type="text" /> <input name="password" type="password" /> <input name="continue" type="submit" value="Login" /> <input name="continue" type="button" value="Clear" /> </form> </body> <html> username = driver.find_element_by_name('username') password = driver.find_element_by_name('password')
# 通過鏈接文本定位 <html> <body> <p>Are you sure you want to do this?</p> <a href="continue.html">Continue</a> <a href="cancel.html">Cancel</a> </body> <html> continue_link = driver.find_element_by_link_text('Continue') continue_link = driver.find_element_by_partial_link_text('Conti')
# 通過標簽名定位 <html> <body> <h1>Welcome</h1> <p>Site content goes here.</p> </body> <html> heading1 = driver.find_element_by_tag_name('h1')
# 通過類名定位 <html> <body> <p class="content">Site content goes here.</p> </body> <html> content = driver.find_element_by_class_name('content')
# 通過CSS選擇器定位 <html> <body> <p class="content">Site content goes here.</p> </body> <html> content = driver.find_element_by_css_selector('p.content') # 推薦使用xpath定位 username = driver.find_element_by_xpath("//form[input/@name='username']") username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]") username = driver.find_element_by_xpath("//input[@name='username']")
4.節點操作
ele.text 拿到節點的內容 (包括后代節點的所有內容)
driver.find_element_by_id('gin').text
ele.send_keys("")搜索框輸入文字
driver.find_element_by_id('kw').send_keys("Python")
ele.click()標簽
driver.find_element_by_id('su').click()
ele.get_attribute("")獲取屬性值
# 獲取元素標簽的內容 att01 = a.get_attribute('textContent') # # 獲取元素內的全部HTML att02 = a.get_attribute('innerHTML') # # 獲取包含選中元素的HTML att03 = a.get_attribute('outerHTML') # 獲取該元素的標簽類型 tag01 = a_href.tag_name
5.動作鏈
from selenium.webdriver import ActionChains source = browser.find_element_by_css_selector('') target = browser.find_element_by_css_selector('') actions = ActionChains(browser) actions.drag_and_drop(source, target).perform() actions.release()
6.在頁面間切換
適用與頁面中點開鏈接出現新的頁面的網站,但是瀏覽器對象browser還是之前頁面的對象
window_handles = driver.window_handles
driver.switch_to.window(window_handles[-1])
7.保存網頁截圖
driver.save_screenshot('screen.png')
8.執行JavaScript
browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
9.前進和后退
browser.back()
browser.forward()
10.等待
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Chrome() driver.get("http://somedomain/") try: element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "myDynamicElement")) ) finally: driver.quit()
條件

title_is
title_contains
presence_of_element_located
visibility_of_element_located
visibility_of
presence_of_all_elements_located
text_to_be_present_in_element
text_to_be_present_in_element_value
frame_to_be_available_and_switch_to_it
invisibility_of_element_located
element_to_be_clickable
staleness_of
element_to_be_selected
element_located_to_be_selected
element_selection_state_to_be
element_located_selection_state_to_be
alert_is_present
11.Cookie處理
獲取、添加、刪除Cookies
browser.get_cookies() browser.add_cookie({'name': 'name', 'domain': 'www.zhihu.com', 'value': 'germey'}) browser.delete_all_cookies()
12. 搜索屬性值
- 獲得element之后搜索
url = driver.find_element_by_name('t2').get_attribute('href')
- 頁面源碼中搜索
源碼中搜索字符串,可以是文本值也可以是屬性值 res = driver.page_source.find('字符串') 返回值 -1 未找到 其他 找到
13.谷歌無頭瀏覽器
from selenium.webdriver.chrome.options import Options。 chrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') browser = webdriver.Chrome(executable_path=path, chrome_options=chrome_options)
14.規避監測
相關的網站會對selenium發起的請求進行監測,網站后台可以根據window.navigator.webdriver返回值進行selenium的監測,若返回值為undefinded,則不是selenium進行的請求發送;若為true,則是selenium發起的請求。
規避監測的方法:
from selenium.webdriver import ChromeOptions option = ChromeOptions() option.add_experimental_option('excludeSwitches', ['enable-automation']) bro = webdriver.Chrome(executable_path='chromedriver.exe',options=option)
15. 切換子框架
此操作主要作用與 ifram子框架 的互相切換使用
iframe = driver.find_element_by_xxx('') driver.switch_to_frame(節點對象)
16. 不請求圖片模式
只需要如下設置則不會請求圖片, 會加快效率
chrome_opt = webdriver.ChromeOptions() prefs = {"profile.managed_default_content_settings.images": 2} chrome_opt.add_experimental_option("prefs", prefs)
四、鼠標鍵盤操作(ActionChains)
1. ActionChains基本用法
ActionChains的執行原理:當你調用ActionChains的方法時,不會立即執行,而是會將所有的操作按順序存放在一個隊列里,當你調用perform()方法時,隊列中的時間會依次執行
有兩種調用方法:
鏈式寫法
menu = driver.find_element_by_css_selector(".nav") hidden_submenu = driver.find_element_by_css_selector(".nav #submenu1") ActionChains(driver).move_to_element(menu).click(hidden_submenu).perform()
分步寫法
menu = driver.find_element_by_css_selector(".nav") hidden_submenu = driver.find_element_by_css_selector(".nav #submenu1") actions = ActionChains(driver) actions.move_to_element(menu) actions.click(hidden_submenu) actions.perform()
兩種寫法本質是一樣的,ActionChains都會按照順序執行所有的操作。
2. ActionChains方法列表
click(on_element=None) ——單擊鼠標左鍵 click_and_hold(on_element=None) ——點擊鼠標左鍵,不松開 context_click(on_element=None) ——點擊鼠標右鍵 double_click(on_element=None) ——雙擊鼠標左鍵 send_keys(*keys_to_send) ——發送某個鍵到當前焦點的元素 send_keys_to_element(element, *keys_to_send) ——發送某個鍵到指定元素 key_down(value, element=None) ——按下某個鍵盤上的鍵 key_up(value, element=None) ——松開某個鍵 drag_and_drop(source, target) ——拖拽到某個元素然后松開 drag_and_drop_by_offset(source, xoffset, yoffset) ——拖拽到某個坐標然后松開 move_by_offset(xoffset, yoffset) ——鼠標從當前位置移動到某個坐標 move_to_element(to_element) ——鼠標移動到某個元素 move_to_element_with_offset(to_element, xoffset, yoffset) ——移動到距某個元素(左上角坐標)多少距離的位置 perform() ——執行鏈中的所有動作 release(on_element=None) ——在某個元素位置松開鼠標左鍵
3. 代碼示例
(1)點擊操作
# -*- coding: utf-8 -*- from selenium import webdriver from selenium.webdriver.common.action_chains import ActionChains from time import sleep driver = webdriver.Firefox() driver.implicitly_wait(10) driver.maximize_window() driver.get('http://sahitest.com/demo/clicks.htm') click_btn = driver.find_element_by_xpath('//input[@value="click me"]') # 單擊按鈕 doubleclick_btn = driver.find_element_by_xpath('//input[@value="dbl click me"]') # 雙擊按鈕 rightclick_btn = driver.find_element_by_xpath('//input[@value="right click me"]') # 右鍵單擊按鈕 ActionChains(driver).click(click_btn).double_click(doubleclick_btn).context_click(rightclick_btn).perform() # 鏈式用法 print driver.find_element_by_name('t2').get_attribute('value') sleep(2) driver.quit()
element.get_attribute()獲取某個元素屬性
(2)鼠標移動
# -*- coding: utf-8 -*- from selenium import webdriver from selenium.webdriver.common.action_chains import ActionChains from time import sleep driver = webdriver.Firefox() driver.implicitly_wait(10) driver.maximize_window() driver.get('http://sahitest.com/demo/mouseover.htm') write = driver.find_element_by_xpath('//input[@value="Write on hover"]') # 鼠標移動到此元素,在下面的input框中會顯示“Mouse moved” blank = driver.find_element_by_xpath('//input[@value="Blank on hover"]') # 鼠標移動到此元素,會清空下面input框中的內容 result = driver.find_element_by_name('t1') action = ActionChains(driver) action.move_to_element(write).perform() # 移動到write,顯示“Mouse moved” print result.get_attribute('value') # action.move_to_element(blank).perform() action.move_by_offset(10, 50).perform() # 移動到距離當前位置(10,50)的點,與上句效果相同,移動到blank上,清空 print result.get_attribute('value') action.move_to_element_with_offset(blank, 10, -40).perform() # 移動到距離blank元素(10,-40)的點,可移動到write上 print result.get_attribute('value') sleep(2)
(3)拖拽
# -*- coding: utf-8 -*- from selenium import webdriver from selenium.webdriver.common.action_chains import ActionChains from time import sleep driver = webdriver.Firefox() driver.implicitly_wait(10) driver.maximize_window() driver.get('http://sahitest.com/demo/dragDropMooTools.htm') dragger = driver.find_element_by_id('dragger') # 被拖拽元素 item1 = driver.find_element_by_xpath('//div[text()="Item 1"]') # 目標元素1 item2 = driver.find_element_by_xpath('//div[text()="Item 2"]') # 目標2 item3 = driver.find_element_by_xpath('//div[text()="Item 3"]') # 目標3 item4 = driver.find_element_by_xpath('//div[text()="Item 4"]') # 目標4 action = ActionChains(driver) action.drag_and_drop(dragger, item1).perform() # 1.移動dragger到item1 sleep(2) action.click_and_hold(dragger).release(item2).perform() # 2.效果與上句相同,也能起到移動效果 sleep(2) action.click_and_hold(dragger).move_to_element(item3).release().perform() # 3.效果與上兩句相同,也能起到移動的效果 sleep(2) # action.drag_and_drop_by_offset(dragger, 400, 150).perform() # 4.移動到指定坐標 action.click_and_hold(dragger).move_by_offset(400, 150).release().perform() # 5.與上一句相同,移動到指定坐標 sleep(2) driver.quit()
一般用坐標定位很少,用上例中的方法1足夠了,如果看源碼,會發現方法2其實就是方法1中的drag_and_drop()的實現。注意:拖拽使用時注意加等待時間,有時會因為速度太快而失敗。
(4)按鍵
模擬按鍵有多種方法,能用win32api來實現,能用SendKeys來實現,也可以用selenium的WebElement對象的send_keys()方法來實現,這里ActionChains類也提供了幾個模擬按鍵的方法。
# -*- coding: utf-8 -*- from selenium import webdriver from selenium.webdriver.common.action_chains import ActionChains from time import sleep driver = webdriver.Firefox() driver.implicitly_wait(10) driver.maximize_window() driver.get('http://sahitest.com/demo/keypress.htm') key_up_radio = driver.find_element_by_id('r1') # 監測按鍵升起 key_down_radio = driver.find_element_by_id('r2') # 監測按鍵按下 key_press_radio = driver.find_element_by_id('r3') # 監測按鍵按下升起 enter = driver.find_elements_by_xpath('//form[@name="f1"]/input')[1] # 輸入框 result = driver.find_elements_by_xpath('//form[@name="f1"]/input')[0] # 監測結果 # 監測key_down key_down_radio.click() ActionChains(driver).key_down(Keys.CONTROL, enter).key_up(Keys.CONTROL).perform() print result.get_attribute('value') # 監測key_up key_up_radio.click() enter.click() ActionChains(driver).key_down(Keys.SHIFT).key_up(Keys.SHIFT).perform() print result.get_attribute('value') # 監測key_press key_press_radio.click() enter.click() ActionChains(driver).send_keys('a').perform() print result.get_attribute('value') driver.quit()
示例2:
# -*- coding: utf-8 -*- from selenium import webdriver from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.common.keys import Keys from time import sleep driver = webdriver.Firefox() driver.implicitly_wait(10) driver.maximize_window() driver.get('http://sahitest.com/demo/label.htm') input1 = driver.find_elements_by_tag_name('input')[3] input2 = driver.find_elements_by_tag_name('input')[4] action = ActionChains(driver) input1.click() action.send_keys('Test Keys').perform() action.key_down(Keys.CONTROL).send_keys('a').key_up(Keys.CONTROL).perform() # ctrl+a action.key_down(Keys.CONTROL).send_keys('c').key_up(Keys.CONTROL).perform() # ctrl+c action.key_down(Keys.CONTROL, input2).send_keys('v').key_up(Keys.CONTROL).perform() # ctrl+v print input1.get_attribute('value') print input2.get_attribute('value') driver.quit()
五、使用示例
示例1:打開百度,搜索爬蟲
from selenium import webdriver from time import sleep bro = webdriver.Chrome() bro.get(url='https://www.baidu.com/') sleep(2) text_input = bro.find_element_by_id('kw') text_input.send_keys('爬蟲') sleep(2) bro.find_element_by_id('su').click() sleep(3) print(bro.page_source) bro.quit()
示例2:獲取豆瓣電影中更多電影詳情數據(谷歌無頭瀏覽器)
from selenium import webdriver from time import sleep from selenium.webdriver.chrome.options import Options 第1步:下面三行固定 chrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') url = 'https://movie.douban.com/typerank?type_name=%E6%83%8A%E6%82%9A&type=19&interval_id=100:90&action=' 第2步:把chrome_options對象作為參數 bro = webdriver.Chrome(chrome_options=chrome_options) bro.get(url) sleep(3) bro.execute_script('window.scrollTo(0,document.body.scrollHeight)') sleep(3) bro.execute_script('window.scrollTo(0,document.body.scrollHeight)') sleep(3) bro.execute_script('window.scrollTo(0,document.body.scrollHeight)') sleep(2) page_text = bro.page_source with open('./douban.html','w',encoding='utf-8') as fp: fp.write(page_text) print(page_text) sleep(1) bro.quit()
示例3:登錄qq空間
在web 中,經常會遇到frame 嵌套頁面的應用,使用WebDriver 每次只能在一個頁面上識別元素,對於frame 嵌套內的頁面上的元素,直接定位是定位是定位不到的。這個時候就需要通過switch_to_frame()方法將當前定位的主體切換了frame 里。先定位到iframe,再在iframe中進行標簽定位。否則,定位不到我們想要的標簽。
import requests from selenium import webdriver from lxml import etree import time driver = webdriver.Chrome(executable_path=r'C:\Users\Administrator\chromedriver.exe') driver.get('https://qzone.qq.com/') #switch_to操作切換frame,此時才能進行登陸頁面的操作。 driver.switch_to.frame('login_frame')
#點擊使用賬號密碼登陸,需要綁定click事件 driver.find_element_by_id('switcher_plogin').click() #driver.find_element_by_id('u').clear() driver.find_element_by_id('u').send_keys('QQ') #driver.find_element_by_id('p').clear() driver.find_element_by_id('p').send_keys('密碼') #點擊登陸,綁定click事件 driver.find_element_by_id('login_button').click() time.sleep(2) driver.execute_script('window.scrollTo(0,document.body.scrollHeight)') time.sleep(2) driver.execute_script('window.scrollTo(0,document.body.scrollHeight)') time.sleep(2) driver.execute_script('window.scrollTo(0,document.body.scrollHeight)') time.sleep(2) page_text = driver.page_source #獲取頁面源碼數據,注意page_source無括號。 tree = etree.HTML(page_text) #執行解析操作 li_list = tree.xpath('//ul[@id="feed_friend_list"]/li') for li in li_list: text_list = li.xpath('.//div[@class="f-info"]//text()|.//div[@class="f-info qz_info_cut"]//text()') text = ''.join(text_list) print(text+'\n\n\n') driver.quit()
發現小框是嵌套在大框里面的,在當前的html源碼中,又嵌套了一個html子頁面,這個子頁面是包含在iframe標簽中的。所以,如果定位的標簽是存在於iframe中的,那么一定需要使用switch to函數,將當前瀏覽器頁面的參照物切換到iframe中,iframe中有一個id為login_frame的屬性值,可以根據它來定位。
示例4:利用搜狗搜索接口抓取微信公眾號(無頭、規避檢測、等待、切換頁面)
# 添加啟動參數 (add_argument) # 添加實驗性質的設置參數 (add_experimental_option) from selenium import webdriver from selenium.webdriver.support.wait import WebDriverWait import time import requests from lxml import etree option = webdriver.ChromeOptions() option.add_argument('headless') #設置chromedriver啟動參數,規避對selenium的檢測機制 option.add_experimental_option('excludeSwitches', ['enable-automation']) driver = webdriver.Chrome(chrome_options=option) url = 'http://weixin.sogou.com/weixin?type=1&s_from=input&query=python_shequ' driver.get(url) print(driver.title) timeout = 5 link = WebDriverWait(driver, timeout).until( lambda d: d.find_element_by_link_text('Python愛好者社區')) link.click() time.sleep(1) # 切換頁面 window_handles = driver.window_handles driver.switch_to.window(window_handles[-1]) print(driver.title) article_links = WebDriverWait(driver, timeout).until( # EC.presence_of_element_located((By.XPATH, '//h4[@class="weui_media_title"]')) lambda d: d.find_elements_by_xpath('//h4[@class="weui_media_title"]')) article_link_list = [] for item in article_links: article_link = 'https://mp.weixin.qq.com' + item.get_attribute('hrefs') # print(article_link) article_link_list.append(article_link) print(article_link_list) first_article_link = article_link_list[0] header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60 } response = requests.get(first_article_link, headers=header, timeout=5 ) tree = etree.HTML(response.text) title = tree.xpath('//h2[@id="activity-name"]/text()')[0].strip() content = tree.xpath('//div[@id="js_content"]//text()') content = ''.join(content).strip() print(title) print(content)
示例5:用selenium實現一個頭條號的模擬發文接口

import time import redis from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.wait import WebDriverWait r = redis.Redis('127.0.0.1', 6379) def toutiao_save_and_preview(title, content, expand_link): option = webdriver.ChromeOptions() option.add_argument('headless') driver = webdriver.Chrome(chrome_options=option) # 獲取渲染的正文 driver.get('file:///Users/Documents/toutiao.html') driver.execute_script("contentIn('"+ content +"');") timeout = 5 content_copy = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath('//button[@class="btn"]')) content_copy.click() # 模擬登錄發文頁面 cookie_toutiao = [{'name': 'ccid', 'value': 'db43e70fd9404338c49209ba04f7a11f'}, {'name': 'tt_webid', 'value': '6612748996061414925'}, {'name': 'UM_distinctid', 'value': '1667a53d28d449-0e229246a33996-4a506a-1fa400-1667a53d28e361'}, {'name': 'sso_uid_tt', 'value': '4c8179804d74252717c675607c721602'}, {'name': 'toutiao_sso_user', 'value': '8acc9b248cd201034637248021183d5a'}, {'name': 'sso_login_status', 'value': '1'}, {'name': 'sessionid', 'value': '8441fa3fc5ae5bc08c3becc780b5b2df'}, {'name': '_mp_test_key_1', 'value': '6aba81df9e257bea2a99713977f1e33b'}, {'name': 'uid_tt', 'value': '75b5b52039d4c9dd41315d061c833f0b'}, {'name': 'ccid', 'value': '4231c5cd5a98033f2e78336b1809a18a'}, {'name': 'tt_webid', 'value': '6631884089946523149'}, {'name': 'UM_distinctid', 'value': '16783e1566479-0ae7bcdcaeb592-113b6653-13c680-16783e156656d4'}, {'name': 'passport_auth_status', 'value': '99f731f2c6dc150e6dfea46799f20e90'}, {'name': 'sso_uid_tt', 'value': 'f4bcd2cf972384b428449b0479475ce6'}, {'name': 'toutiao_sso_user', 'value': '60df7bb620b4b6d1d17a1de83daec9c1'}, {'name': 'sso_login_status', 'value': '1'}, {'name': 'sessionid', 'value': '786fe64e9186d51b8427290a557b3c7b'}, {'name': 'uid_tt', 'value': '91a7a72a85861ae686fb66177bc16bca'}, {'name': '__tea_sdk__ssid', 'value': '60b289e6-e2a4-4494-a3e8-7936f9731426'}, {'name': 'uuid', 'value': 'w:3ec91cefd76b438583154fea77baa54b'}, {'name': 'tt_im_token', 'value': '1544105894108419437114683515671344747598423336731147829901779697'}] driver.get('https://mp.toutiao.com/profile_v3/index') for cookie in cookie_toutiao: driver.add_cookie(cookie) driver.get('https://mp.toutiao.com/profile_v3/graphic/publish') print(driver.title) # driver.maximize_window() # 寫標題 print('寫標題') write_title = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath('//*[@id="title"]')) write_title.click() write_title.send_keys(title) # 粘貼正文 print('寫正文') write_content = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath( '//*[@id="graphic"]/div/div/div[2]/div[1]/div[2]/div[3]/div[2] | //div[contains(@class,"ql-editor")]')) write_content.click() write_content.clear() write_content.send_keys(Keys.SHIFT + Keys.INSERT) # time.sleep(1) # 檢測圖片上傳是否完成 try: if 'img' in content: WebDriverWait(driver, timeout).until( lambda d: d.find_element_by_xpath('//div[@class="pgc-img-wrapper"]')) print('images uploaded success') else: print('no image included') except: print('images uploaded fail') # 頁面向下滾動 driver.execute_script("window.scrollTo(0, document.body.scrollHeight)") time.sleep(1) # 添加擴展鏈接 expand_check = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath( '//div[@class="pgc-external-link"]//input[@type="checkbox"]', )) expand_check.click() expand_link_box = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath( '//div[@class="link-input"]//input[@type="text"]', )) expand_link_box.send_keys(expand_link) time.sleep(1) # 自動封面 front_img = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath( '//div[@class="article-cover"]/div/div[@class="tui2-radio-group"]/label[3]/div/input', )) front_img.click() time.sleep(1) # 保存草稿 save_draft = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath( '//div[@class="publish-footer"]/button[4]/span')) save_draft.click() time.sleep(1) # 從內容管理頁,獲取預覽鏈接和文章ID print('get preview_link and article_id') # driver.refresh() preview_link = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath( '//div[@id="article-list"]//div[@class="master-title"][1]/a')).get_attribute('href') article_id = preview_link.split('=')[-1] print(preview_link, article_id) time.sleep(1) content_management = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_link_text('內容管理')) content_management.click() time.sleep(1) driver.refresh() preview_link = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath( '//*[@id="article-list"]/div[2]/div/div/div[1]/div/a')).get_attribute('href') article_id = preview_link.split('=')[-1] index_page = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath('//a[@class="shead_logo"]')) index_page.click() driver.get('https://mp.toutiao.com/profile_v3/index') print(r.scard('cookie_pool_toutiao')) return preview_link, article_id if __name__ == "__main__": print('start') start_time = time.time() title = 'Children' content = '<p>cute</p><p><img class="wscnph" src="http://img.mp.itc.cn/upload/20170105/1a7095f0c7eb4316954dda4a8b93b88c_th.jpg" /></p>' expand_link = 'https://www.cnblogs.com/Summer-skr--blog/' img = '' preview_link, article_id = toutiao_save_and_preview(title, content, expand_link) print(preview_link) print(article_id) finish_time = time.time() print(finish_time - start_time)
selenium相關文檔:
https://www.seleniumhq.org/docs/
https://selenium-python.readthedocs.io
哈哈,認認真真的寫了這么長博文,如果您覺得對您有幫助,麻煩幫忙點個贊哦!一起加油!