微信公眾號爬蟲--歷史文章


今天搞了一個微信公眾號歷史文章爬蟲的demo,親測可行,記錄一下!(不喜勿噴)

缺點:1.不是很智能  2. 兼容性不是很好,但是能應付正常情況啦

使用mysql+request

數據庫部分

直接建表ddl吧:

CREATE TABLE `wechat_content` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `wechat_name` varchar(255) DEFAULT NULL COMMENT '公眾號名字',
  `title` varchar(225) DEFAULT NULL COMMENT '文章標題',
  `content_url` varchar(1000) DEFAULT NULL COMMENT '文章地址',
  `cover` varchar(1000) DEFAULT NULL COMMENT '封面圖',
  `source_url` varchar(1000) DEFAULT NULL COMMENT '轉載url',
  `source_name` varchar(255) DEFAULT NULL COMMENT '轉載公眾號名',
  `datetime` varchar(255) DEFAULT NULL COMMENT '發布時間',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1629 DEFAULT CHARSET=utf8

def get_sqlConn():
try:
conn = pymysql.connect(
# host='localhost',
host = ip,
port=3306,
user='root',
password=密碼,
db='py_mysql_test',
charset='utf8'
)
print('數據庫連接成功!')
return conn
except:
print('error')

插入方法:

def insert_wechat_content(wechat_name,title,content_url,cover,source_url,source_name,datetime):
try:
conn = get_sqlConn()
cur = conn.cursor()
# sql = "INSERT INTO anjuke_beijing_onenum_all_house(house_name,house_plate,house_url,create_time) VALUES (%s,%s,%s,%s)"
sql = """INSERT INTO wechat_content(wechat_name,title,content_url,cover,source_url,source_name,datetime) VALUES (%s,%s,%s,%s,%s,%s,%s)""" % (wechat_name,title,content_url,cover,source_url,source_name,datetime)
print("微信公眾號插入sql:%s" % sql)
cur.execute(sql)
conn.commit()
print('插入數據成功!')
except Exception as e:
print('插入發生數據錯誤!ERROR:%s' % e)
conn.rollback() # 回滾
finally:
cur.close()
conn.close()
print('操作數據庫完畢!')
為了防止有重復的,還有一個查詢方法:
def select_wechat_content(title):
conn = get_sqlConn()
cur = conn.cursor()
try:
sql = "SELECT EXISTS(SELECT 1 FROM wechat_content WHERE title=%s)" % title
print("微信公眾號查詢SQL:%s" % sql)
cur.execute(sql)
return cur.fetchall()[0]
except Exception as e:
print('查詢發生數據錯誤!ERROR:%s' % e)
conn.rollback()
cur.close()
conn.close()

Python 爬取部分

首先分析一下,通過charles抓包可以看到歷史文章的請求

/mp/profile_ext?action=getmsg&__biz=MzU0NDQ2OTkzNw==&f=json&offset=17&count=10&is_ok=1&scene=&uin=MTU4MzgxNjcwNg%3D%3D&key=5a37b8e9f2933463aa4c791beaedc828c781ae48f9a58c2067595d03e2a4da3d43e47af1b87aea58849a45838a5cd1375e69afd980a0562d3327ff9a7227684fa872ad73ae54f8d9ae5b2392595e0a4d&pass_ticket=n9Zz%2F2GEUA9SBL%2FLVdK8uLAPMwNph3rMVVksmgD0xrMOstqSxkc%2BaMVRVnfNAC9M&wxtoken=&appmsg_token=1030_sVyKhffomeHucF5TrTgG3CyPO9kX-j3obN4DNg~~&x5=0&f=json

這是歷史文章的請求接口,通過分析我們得知需要動態獲取的參數有:

__biz : 用戶和公眾號之間的唯一id,
uin :用戶的私密id
key :請求的秘鑰,一段時候只會就會失效。
offset :偏移量
count :每次請求的條數
我的做法是直接直接拿到請求的url,然后解析url中的參數,得到請求的params參數,方法見下:
__biz 應為后面又==,所以最后又重新賦值了
def get_parms(u):
data = u.split("&")
parms = {}
for i in data:
d = i.split("=")
parms[d[0]] = d[1]
parms['__biz'] = parms['__biz']+"=="
print(parms)
return parms
邏輯部分
def get_wx_article(u,wechat_name,index=0, count=10):
    """
:param u: 抓包獲取的請求地址,不要/mp/profile_ext?
:param wechat_name: 公眾號名,往數據庫保存使用
:param index: 翻頁
:param count: 每次請求條數
:return:
"""
    offset = (index + 1) * count
url = "http://mp.weixin.qq.com/mp/profile_ext?"

params = get_parms(u)
params['offset'] = offset
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',
'Cookie': 'rewardsn=; wxtokenkey=777; wxuin=1583816706; devicetype=Windows10; version=62070141; lang=zh_CN; pass_ticket=n9Zz/2GEUA9SBL/LVdK8uLAPMwNph3rMVVksmgD0xrMOstqSxkc+aMVRVnfNAC9M; wap_sid2=CILAnPMFElw5Z0w3VXRGdjhNTlF4Ujd0YXFUSjM0MUpkSGFkcUdHTC0wa08tcUR3aEtWZElvcGRwTnUtUjllbHRTU3ctZ0JJQkR0RW1TZjgwNVZZd1RCaTMwNkZSd1lFQUFBfjCjgIDtBTgNQJVO'
}

response = requests.get(url=url,params=params, headers=headers)
print(response.text)
resp_json = response.json()
if resp_json.get('errmsg') == 'ok':
resp_json = response.json()
# 是否還有分頁數據, 用於判斷return的值
can_msg_continue = resp_json['can_msg_continue']
# 當前分頁文章數
msg_count = resp_json['msg_count']
general_msg_list = json.loads(resp_json['general_msg_list'])
list = general_msg_list.get('list')
print(list, "**************")
wechat_name = wechat_name
wechat_name = "'{}'".format(wechat_name)
for i in list:
print("=====>%s" % i)
if 'app_msg_ext_info' not in i:# 有特殊的公眾號沒有app_msg_ext_info字段,如果沒有就跳過
                continue
app_msg_ext_info = i['app_msg_ext_info']
# 標題
title = app_msg_ext_info['title']
title = "'{}'".format(title)
# 文章地址
content_url = app_msg_ext_info['content_url']
content_url = "'{}'".format(content_url)
# 封面圖
cover = app_msg_ext_info['cover']
cover = "'{}'".format(cover)
# 轉載路徑
source_url = app_msg_ext_info['source_url']
source_url = "'{}'".format(source_url)

# 轉載公眾號
source_name = app_msg_ext_info['author']
source_name = "'{}'".format(source_name)

# 發布時間
datetime = i['comm_msg_info']['datetime']
datetime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(datetime))
datetime = "'{}'".format(datetime)

print(title,content_url)
print(source_url,source_name)
print(cover,datetime)
if select_wechat_content(title) == 1: # 防止數據重復
print("數據已經存在")
else:
insert_wechat_content(wechat_name,title,content_url,cover,source_url,source_name,datetime)

if can_msg_continue == 1:
return True
return False
else:
print('獲取文章異常...')
return False

運行代碼
if __name__ == '__main__':
index = 0
u = "action=getmsg&__biz=MzU0NDQ2OTkzNw==&f=json&offset=17&count=10&is_ok=1&scene=&uin=MTU4MzgxNjcwNg%3D%3D&key=5a37b8e9f2933463aa4c791beaedc828c781ae48f9a58c2067595d03e2a4da3d43e47af1b87aea58849a45838a5cd1375e69afd980a0562d3327ff9a7227684fa872ad73ae54f8d9ae5b2392595e0a4d&pass_ticket=n9Zz%2F2GEUA9SBL%2FLVdK8uLAPMwNph3rMVVksmgD0xrMOstqSxkc%2BaMVRVnfNAC9M&wxtoken=&appmsg_token=1030_sVyKhffomeHucF5TrTgG3CyPO9kX-j3obN4DNg~~&x5=0&f=json"
while 1:
print(f'開始抓取公眾號第{index + 1} 頁文章.')
flag = get_wx_article(u, "Python學習開發", index=index)
# 防止和諧,暫停8秒
time.sleep(8)
index += 1
if not flag:
print('公眾號文章已全部抓取完畢,退出程序.')
break

print(f'..........准備抓取公眾號第{index + 1} 頁文章.')

最終效果:

 

 原文鏈接:https://www.cnblogs.com/cxiaolong/p/11318439.html

后續可以再把首頁的文章添加進去,敬請期待


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM