【Python爬蟲】之爬取頁面內容、圖片以及用selenium爬取

本文轉載自查看原文 2020-01-22 17:27 1578 爬蟲/ Selenium/ Mac/ Python

下面不做過多文字描述：

首先、安裝必要的庫

# 安裝BeautifulSoup
pip install beautifulsoup4

# 安裝requests
pip install requests

其次、上代碼！！！

①重定向網站爬蟲h4文字

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from PIL import Image


# 重定向爬蟲h4
url = "http://www.itest.info/courses"
soup = BeautifulSoup(requests.get(url).text,'html.parser')

for courses in soup.find_all('p'):
    print(courses.text)
    print("\r")

②v2ex爬取標題

import requests
from bs4 import BeautifulSoup

# v2ex爬蟲標題
url = "https://www.v2ex.com"
v2ex = BeautifulSoup(requests.get(url).text,'html.parser')

for span in v2ex.find_all('span',class_='item_hot_topic_title'):
    print(span.find('a').text,span.find('a')['href'])

for title in v2ex.find_all("a",class_="topic-link"):
    print(title.text,url+title["href"])

③煎蛋爬蟲圖片

import requests
from bs4 import BeautifulSoup



headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}

def download_file(url):
    '''下載圖片'''
    print('Downding %s' %url)
    local_filename = url.split('/')[-1]
    # 指定目錄保存圖片
    img_path = "/Users/zhangc/Desktop/GitTest/project_Buger_2/Python爬蟲/img/" + local_filename
    print(local_filename)
    r = requests.get(url, stream=True, headers=headers)
    with open(img_path, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
                f.flush()
    return img_path

url = 'http://jandan.net/drawings'
soup = BeautifulSoup(requests.get(url, headers=headers).text, 'html.parser')

def valid_img(src):
    '''判斷地址符不符合關鍵字'''
    return src.endswith('jpg') and '.sinaimg.cn' in src

for img in soup.find_all('img', src=valid_img):
    src = img['src']
    if not src.startswith('http'):
        src = 'http:' + src
    download_file(src)

④爬取知乎熱門標題

import requests
from bs4 import BeautifulSoup

headers ={
    "user-agent":"user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
}
url = "https://www.zhihu.com/explore"
zhihu = BeautifulSoup(requests.get(url,headers=headers).text,"html.parser")
for title in zhihu.find_all('a',class_="ExploreSpecialCard-contentTitle"):
    print(title.text)

⑤selenium爬蟲知乎熱門標題

import requests
from bs4 import BeautifulSoup


# selenium爬蟲
url = "https://www.zhihu.com/explore"
driver = webdriver.Chrome("/Users/zhangc/Desktop/GitTest/project_Buger_2/poium測試庫/tools/chromedriver")
driver.get(url)

info = driver.find_element(By.CSS_SELECTOR,"div.ExploreHomePage-specials")
for title in info.find_elements(By.CSS_SELECTOR,"div.ExploreHomePage-specialCard > div.ExploreSpecialCard-contentList > div.ExploreSpecialCard-contentItem > a.ExploreSpecialCard-contentTitle"):
    print(title.text,title.get_attribute('href'))

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲之Selenium 動態渲染頁面爬取 Python爬蟲爬取搜狗搜索到的內容頁面 Python爬蟲爬取貼吧的帖子內容 python爬蟲一之爬取分頁下的內容 Python爬蟲筆記：爬取單個頁面 Python 爬蟲實例（8）—— 爬取動態頁面 Ins圖片爬取（基於python,selenium） python爬蟲的圖片信息爬取 Python爬蟲功能（爬取網頁圖片） Python爬蟲——爬取網頁圖片