騰訊新聞數據爬取

本文轉載自查看原文 2021-06-25 17:06 267

一、選題的背景

騰訊新聞是一款主打“事實派”的資訊類，月活超過2.4億，且用戶具備深度瀏覽習慣。騰訊新聞信息流廣告正是以原生方式出現在資訊信息流中，根據用戶屬性、歷史瀏覽行為和興趣愛好進行精准定向投。如今普遍年輕人喜歡玩游戲，沉迷手機，我覺得應該多了解國家時事，多看社會新聞，所以我選擇了爬取騰訊新聞這個項目

二、主題式網絡爬蟲設計方案

1.主題式網絡爬蟲名稱

騰訊新聞數據爬取

2.主題式網絡爬蟲爬取的內容與數據特征分析

騰信新聞數據爬取

3.主題式網絡爬蟲設計方案概述（包括實現思路與技術難點）

從網頁源代碼中找出數據對應標簽對數據進行分析

三、主題頁面的結構特征分析

1.主題頁面的結構與特征分析

2.Htmls 頁面解析

四、網絡爬蟲程序設計

## 經濟、娛樂、軍事

# * 新聞標題

# * 新聞內容

# * 新聞標簽

import re

import requests

from bs4 import BeautifulSoup

import pandas as pd

import csv

import jieba

import time

import random

def getHTMLText(url):

'''

獲取網頁html

'''

user_agent = [

"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",

"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",

"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",

"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko"

]

headers = {'User-Agent': random.choice(user_agent)}

try:

r = requests.get(url)

r.raise_for_status()

r.encoding = 'utf-8'

return r.content

except:

return ""

def parase_index(url):

'''

解析首頁內容

'''

html=getHTMLText(url)

soup = BeautifulSoup(html, "lxml")

uls=soup.find_all('ul')

news_type=""#新聞類別

if "finance" in url:

news_type="財經"

elif "ent" in url:

news_type="娛樂"

elif "milite" in url:

news_type="軍事"

elif "tech" in url:

news_type="科技"

elif "world" in url:

news_type="國際"

print("==========================={}===========================".format(news_type))

for l in uls[1].find_all('li'):

detail_url=l.a.attrs['href']#詳情頁鏈接

try:

title,content=getContent(detail_url)#獲取詳情頁的標題名稱，正文

except:

continue

print(title)

tags=l.find_all(attrs={'class':'tags'})#新聞標簽

#提取標簽文字

tags=re.findall('target="_blank">(.*?)',str(tags[0]))

tags=",".join(tags)

writer.writerow((news_type,tags,title,content))

time.sleep(2)

def getContent(url):

'''

解析新聞正文html

'''

html = getHTMLText(url)

soup = BeautifulSoup(html, "lxml")

title=soup.h1.get_text()#獲取標題

artical=soup.find_all(attrs={'class':'one-p'})

content=""

for para in artical:

content+=para.get_text()

return title,content

def update(old,new):

'''

更新數據集：將本次新爬取的數據加入到數據集中（去除掉了重復元素）

'''

data=new.append(old)

data=data.drop_duplicates()

return data

def word_count(data):

'''

詞頻統計

'''

txt=""

for i in data:

txt+=str(i)

#加載停用詞表

stopwords = [line.strip() for line in open("stop_words.txt",encoding="utf-8").readlines()]

words = jieba.lcut(txt)

counts = {}

for word in words:

#不在停用詞表中

if word not in stopwords:

#不統計字數為一的詞

if len(word) == 1:

continue

else:

counts[word] = counts.get(word,0) + 1

items = list(counts.items())

items.sort(key=lambda x:x[1], reverse=True)

return pd.DataFrame(items)

#需要爬取的鏈接：經濟、娛樂、軍事、科技、國際

url_list=['https://new.qq.com/ch/finance/',

'https://new.qq.com/ch/ent/',

'https://new.qq.com/ch/milite/',

'https://new.qq.com/ch/tech/',

'https://new.qq.com/ch/world/'

]

#定義數據集保存的文件名

file_name="NewsData.csv"

try:

data_old=pd.read_csv(file_name,encoding='gbk')

except:

pass

csvFile = open(file_name, 'a', newline='',encoding="gb2312")

writer = csv.writer(csvFile)

writer.writerow(("新聞分類","新聞標簽","新聞標題","新聞內容"))

for url in url_list:

parase_index(url)

print("爬取完畢！")

csvFile.close()

print("=====================")

print("開始更新數據集")

data_new=pd.read_csv(file_name,encoding='gbk')

update(data_old,data_new).to_csv(file_name,index=None,encoding='gbk')

print("更新完畢!")

print("=================")

print("開始詞頻統計")

data=pd.read_csv(file_name,encoding="gbk")

res=word_count(data['新聞內容'])

res.to_csv("frequence.txt",header=None,index=None)

print("統計完畢!")

print(res)

五、總結

通過這次設計知道做一個項目首先得學會需求分析，數據分析等，明白自己要做出一個什么效果，這才是一個很好的開始，清楚的自己知道要干嘛。也知道了爬蟲的重要性，可以爬取如此多的數據。在這個設計過程中鞏固了許多知識也對一個小項目的開發有了自己的認識，並體會到了完成一個小項目的成就感。需要改進的地方仍有許多，我會爭取加以改進的。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬取騰訊疫情數據某新聞網站的爬取 Python爬取網站新聞如何利用python爬取網易新聞利用scrapy爬取騰訊的招聘信息 python3 scrapy 爬取騰訊招聘央視網《新聞聯播》爬取 [01-01] 示例：用Java爬取新聞實時疫情的新聞爬取及熱詞雲展示 Scrapy項目 - 數據簡析 - 實現騰訊網站社會招聘信息爬取的爬蟲設計