python抓取谷歌學術關鍵詞下文章題目

本文轉載自查看原文 2020-08-21 10:35 938

單網頁版（建議）

只爬取一個網頁，通過手動更改url的數字來實現多個頁面的爬取

#encoding = utf8
# write by xdd1997  xdd2026@qq.com
# 2020-08-21

'''還是不建議一次爬取多個頁面，容易被封，解封時長未知'''
import requests
from bs4 import BeautifulSoup
ii = 90
url = "https://scholar.paodekuaiweixinqun.com/scholar?start={}&q=Cylindrical+Shells&hl=zh-CN&as_sdt=0,5&as_ylo=2016".format(ii)
# https://scholar.paodekuaiweixinqun.com/scholar?start=140&q=Cylindrical+Shells&hl=zh-CN&as_sdt=0,5&as_ylo=2016
print(url)
try:
    kv = {'user-agent':'Mozilla/5.0'}   #應對爬蟲審查
    r = requests.get(url,headers=kv)
    r.raise_for_status()                  #若返回值不是202，則拋出一個異常
    r.encoding = r.apparent_encoding
except:
    print("進入網站失敗")
demo = r.text
soup = BeautifulSoup(demo, "html.parser")
#print(soup)
print('----------------------------------------------------------------------------------------------')
paperlist = []
for ss in soup.find_all('a',{"target":"_blank"}): # 查找<ul class="f-hide"> ...</ul>  ,{"target":"_blank"}
    tex = ss.get_text().replace('  ','').split('\n')
    texp = ''
    if len(tex) >= 6:
        for t in tex:
            if t !=None:
                texp = texp + t
        paperlist.append(texp)
#print(paperlist)
for paper in paperlist:
    if len(paper)>30:  # 排除類似於[PDF] researchgate.net一樣的文本
        print(paper)

多網頁版版（注意，注意，注意）

注意：很有可能會被封，具體多長時間不清楚

關於被封：比如程序爬爬爬，爬到第9頁的時候谷歌發現了，把你封了，那這一頁你就打不開了，手動也打不開，其他頁頁間隔存在打不開的情況

#encoding = utf8
# write by xdd1997  xdd2026@qq.com
# 2020-08-21
'''容易被封，容易被封，容易被封'''

容易被封
import requests
from bs4 import BeautifulSoup
import time
import random
for ii in range(0,80,10):  # 爬取到90.html時會被禁
    url = "https://scholar.paodekuaiweixinqun.com/scholar?start={}&q=Cylindrical+Shells&hl=zh-CN&as_sdt=0,5&as_ylo=2016".format(ii)    #
    print(url)
    try:
        kv = {'user-agent':'Mozilla/5.0'}   #應對爬蟲審查
        r = requests.get(url,headers=kv)
        r.raise_for_status()                  #若返回值不是202，則拋出一個異常
        r.encoding = r.apparent_encoding
    except:
        print("進入網站失敗")
    demo = r.text
    soup = BeautifulSoup(demo, "html.parser")
    #print(soup)
    print('----------------------------------------------------------------------------------------------')
    for ss in soup.find_all('a',{"target":"_blank"}): # 查找<ul class="f-hide"> ...</ul>  ,{"target":"_blank"}
       # for ii in ss.find_all('b'):
        tex = ss.get_text().replace('  ','').split('\n')
        if len(tex) == 7:
            print(tex[1]+ ' ' + tex[3]+ ' '+ tex[6])
    time.sleep(random.random()*10 + 5)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Amazon關鍵詞抓取文章關鍵詞提取算法使用Jieba提取文章的關鍵詞文章關鍵詞在線提取 Anjs分詞器以及關鍵詞抓取使用的方法 SEO優化：如何挖掘關鍵詞谷歌篇 python用kemeans對關鍵詞進行分類 python實現關鍵詞提取計算tfidf，關鍵詞抽取---python python——NLP關鍵詞提取