python爬搜狗微信獲取指定微信公眾號的文章

本文轉載自查看原文 2018-06-22 21:40 6502

前言：

之前收藏了一個叫微信公眾號的文章爬取，里面用到的模塊不錯。然而

偏偏報錯= =。果斷自己寫了一個

正文：

第一步爬取搜狗微信搜到的公眾號：

http://weixin.sogou.com/weixin?type=1&query=FreeBuf&ie=utf8&s_from=input&_sug_=n&_sug_type_=1&w=01015002&oq=&ri=11&sourceid=sugg&sut=0&sst0=1529673558816&lkt=0%2C0%2C0&p=40040108

將FreeBuf改為自己要搜的公眾號

查看網頁源代碼：

正則匹配：

第一個正則：匹配指定的URL 正則： src=.*&timestamp=.*&ver=.*&signature=.*

藍色標出來的是我們要的，注意多請求URL可以注意到URL，signature也就是簽名是隨機變化的。所以可得到正則：.*== ,取第一個，然后打開此鏈接爬取文章鏈接即可（更多細節會在代碼看到）

代碼：

import requests
import re
import threading
user=input('請輸入要搜索的微信公眾號或微信號:')
headers={'user-agent':'Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)'}
url='http://weixin.sogou.com/weixin?type=1&s_from=input&query={}&ie=utf8&_sug_=y&_sug_type_=&w=01015002&oq=jike&ri=0&sourceid=sugg&stj=0%3B0%3B0%3B0&stj2=0&stj0=0&stj1=0&hp=36&hp1=&sut=4432&sst0=1529305369937&lkt=5%2C1529305367635%2C1529305369835'.format(user.rstrip())


def zhuaqu():
    r = requests.get(url=url, headers=headers)
    rsw = re.findall('src=.*&amp;timestamp=.*&amp;ver=.*&amp;signature=.*', str(r.text))
    if '驗證碼' in str(r.text):
        print('[-]發現驗證碼請訪問URL:{}后在重新運行此腳本'.format(r.url))
        exit()
    else:
        cis = re.findall('.*?==', str(rsw[0]))
        qd = "".join(cis)
        qd2 = "{}".format(qd)
        qd3 = qd2.replace(';', '&')
        urls = 'https://mp.weixin.qq.com/profile?'.strip() + qd3
        uewq=requests.get(url=urls,headers=headers)
        if '驗證碼' in str(uewq.text):
                print('[-]發現驗證碼請訪問URL:{}后在重新運行此腳本'.format(uewq.url))
                exit()
        else:
                ldw = re.findall('src = ".*?" ; ', uewq.text)
                ldw2=re.findall('timestamp = ".*?" ; ',uewq.text)
                ldw3=re.findall('ver = ".*?" ; ',uewq.text)
                ldw4=re.findall('signature = ".*?"',uewq.text)
                ldws="".join(ldw)
                ldw2s="".join(ldw2)
                ldw3s="".join(ldw3)
                ldw4s="".join(ldw4)
                ldwsjihe=ldws+ldw2s+ldw3s+ldw4s
                fk=ldwsjihe.split()
                fkchuli="".join(fk)
                gs=fkchuli.replace('"','')
                hew=gs.replace(';','&')
                wanc="http://mp.weixin.qq.com/profile?"+hew
                xiau=requests.get(url=wanc,headers=headers)
                houxu=re.findall('{.*?}',xiau.content.decode('utf-8'))
                title=re.findall('"title":".*?"',str(houxu))
                purl=re.findall('"content_url":".*?"',str(houxu))
                for i in range(0,len(title)):
                    jc='{}:{}'.format(title[i],'https://mp.weixin.qq.com'+purl[i]).replace('"','')
                    jc2=jc.replace('content_url','')
                    jc3=jc2.replace(';','&')
                    print(jc3)

t=threading.Thread(target=zhuaqu,args=())
t.start()

測試結果：

BGM：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 搜狗微信公眾號文章抓取爬取微信公眾號文章 Python爬蟲案例：爬取微信公眾號文章 Python 微信公眾號文章爬取 python 爬取微信公眾號歷史文章基於搜狗微信搜索獲取公眾號文章的閱讀量及點贊量 Python導出微信公眾號文章 python爬取微信公眾號按關鍵字搜索並爬去微信公眾號文章搜狗微信公眾號文章搜索器(搜狗微信公眾號文章批量采集工具)---網賺必備工具