前言

本文的文字及圖片來源於網絡,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯系我們以作處理。

作者：崩壞的芝麻

由於實驗室需要一些語料做研究，語料要求是知網上的論文摘要，但是目前最新版的知網爬起來有些麻煩，所以我利用的是知網的另外一個搜索接口

比如下面這個網頁：
http://search.cnki.net/Search.aspx?q=肉制品

搜索出來的結果和知網上的結果幾乎一樣，另外以后面試找Python工作，項目經驗展示是核心，如果你缺項目練習，去小編的Python交流.裙：一久武其而而流一思（數字的諧音）轉換下可以找到了，里面很多新教程項目

在這個基礎上，我簡單看了些網頁的結構，很容易就能寫出爬取得代碼（是最基礎的，相當不完善，增加其他功能可自行增加）
在這里插入圖片描述
網頁的結構還是很清晰的

在這里插入圖片描述
摘要信息也很清晰

我使用的是 pymysql 連接的數據庫，效率也還可以
下面直接貼代碼：

# -*- coding: utf-8 -*-
import time
import re
import random
import requests
from bs4 import BeautifulSoup
import pymysql

connection = pymysql.connect(host='',
                             user='',
                             password='',
                             db='',
                             port=3306,
                             charset='utf8')  # 注意是utf8不是utf-8

# 獲取游標
cursor = connection.cursor()

#url = 'http://epub.cnki.net/grid2008/brief/detailj.aspx?filename=RLGY201806014&dbname=CJFDLAST2018'

#這個headers信息必須包含，否則該網站會將你的請求重定向到其它頁面
headers = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding':'gzip, deflate, sdch',
    'Accept-Language':'zh-CN,zh;q=0.8',
    'Connection':'keep-alive',
    'Host':'www.cnki.net',
    'Referer':'http://search.cnki.net/search.aspx?q=%E4%BD%9C%E8%80%85%E5%8D%95%E4%BD%8D%3a%E6%AD%A6%E6%B1%89%E5%A4%A7%E5%AD%A6&rank=relevant&cluster=zyk&val=CDFDTOTAL',
    'Upgrade-Insecure-Requests':'1',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}

headers1 = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
    }

def get_url_list(start_url):
    depth = 20
    url_list = []
    for i in range(depth):
        try:
            url = start_url + "&p=" + str(i * 15)
            search = requests.get(url.replace('\n', ''), headers=headers1)
            soup = BeautifulSoup(search.text, 'html.parser')
            for art in soup.find_all('div', class_='wz_tab'):
                print(art.find('a')['href'])
                if art.find('a')['href'] not in url_list:
                    url_list.append(art.find('a')['href'])
            print("爬取第" + str(i) + "頁成功！")
            time.sleep(random.randint(1, 3))
        except:
            print("爬取第" + str(i) + "頁失敗！")
    return url_list

def get_data(url_list, wordType):
    try:
        # 通過url_results.txt讀取鏈接進行訪問
        for url in url_list:
            i = 1;
            if url == pymysql.NULL or url == '':
                continue
            try:
                html = requests.get(url.replace('\n', ''), headers=headers)
                soup = BeautifulSoup(html.text, 'html.parser')
            except:
                print("獲取網頁失敗")
            try:
                print(url)
                if soup is None:
                    continue
                # 獲取標題
                title = soup.find('title').get_text().split('-')[0]
                # 獲取作者
                author = ''
                for a in soup.find('div', class_='summary pad10').find('p').find_all('a', class_='KnowledgeNetLink'):
                    author += (a.get_text() + ' ')
                # 獲取摘要
                abstract = soup.find('span', id='ChDivSummary').get_text()
                # 獲取關鍵詞，存在沒有關鍵詞的情況
            except:
                print("部分獲取失敗")
                pass
            try:
                key = ''
                for k in soup.find('span', id='ChDivKeyWord').find_all('a', class_='KnowledgeNetLink'):
                    key += (k.get_text() + ' ')
            except:
                pass
            print("第" + str(i) + "個url")
            print("【Title】：" + title)
            print("【author】：" + author)
            print("【abstract】：" + abstract)
            print("【key】：" + key)
            # 執行SQL語句
            cursor.execute('INSERT INTO cnki VALUES (NULL, %s, %s, %s, %s, %s)', (wordType, title, author, abstract, key))
            # 提交到數據庫執行
            connection.commit()

            print()
        print("爬取完畢")
    finally:
        print()

if __name__ == '__main__':
    try:
        for wordType in {"大腸桿菌", "菌群總落", "胭脂紅", "日落黃"}:
            wordType = "肉+" + wordType
            start_url = "http://search.cnki.net/search.aspx?q=%s&rank=relevant&cluster=zyk&val=" % wordType
            url_list = get_url_list(start_url)
            print("開始爬取")
            get_data(url_list, wordType)
            print("一種類型爬取完畢")
        print("全部爬取完畢")
    finally:
        connection.close()

在這里的關鍵詞我簡單的選了幾個，作為實驗，如果爬取的很多，可以寫在txt文件里，直接讀取就可以，非常方便。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python爬蟲爬取ECVA論文標題作者摘要關鍵字等信息並存儲到mysql數據庫【python網絡編程】新浪爬蟲：關鍵詞搜索爬取微博數據 python簡單爬蟲（爬取pornhub特定關鍵詞的items圖片集） Python爬蟲-爬取京東商品信息-按給定關鍵詞 python爬取數據並保存到數據庫中（第一次練手完整代碼）爬蟲-python（三）百度搜索關鍵詞后爬取搜索結果 python之scrapy爬取數據保存到mysql數據庫 node 爬蟲 --- 將爬取到的數據，保存到 mysql 數據庫中 Java爬蟲的底層及實現過程（可動手實現爬取京東官網的商品信息數據並保存到數據庫中） WebMagic爬蟲框架（爬取前程無憂網站的招聘信息保存到mysql數據庫）