新浪微博爬取筆記（2）：wap端模擬登陸 python

本文轉載自查看原文 2015-04-15 14:20 2644 新浪微博/ 新浪微博爬取/ 模擬登陸

===================

看了其他人的博客都寫的很簡潔干凈，我這邊的排版簡直要暈。圖和代碼一起上，小白每一步都要有講解。。。

===================

雖然weibo.com的模擬登陸用http://www.cnblogs.com/houkai/p/3487816.html的代碼成功了，但是新版微博的標簽太復雜，而且爬取一個用戶的微博列表的時候，最底端是“正在加載”，不好模擬。因此改爬weibo.cn。

除此之外，weibo.cn比weibo.com更適合爬取的原因還有：

1、weibo.cn的模擬登陸步驟比weibo.com要簡單；

2、微博列表是分頁顯示，目前每頁有5條；

3、評論，轉發列表都是靜態加載。

先進行模擬登陸。模擬登陸有兩種方法，在pc端操作：

1、user-agent用UC瀏覽器，模仿手機端，如http://blog.csdn.net/zhaolina004/article/details/28699095，步驟簡單。需要注意的是，雖然在手機端可以采用保存登陸后頁面即保存cookie下次直接登陸的方式，pc端則不行。

####15年4月登陸后地址變為http://weibo.cn/?vt=4 不再顯示gsid (事實上顯示gsid的網址重定向到weibo.cn/?vt=4，以下截圖中會顯示，直接訪問並不能進入登陸頁面，用UC也不行）

####放棄

2、user-agent用Firefox瀏覽器，如http://qinxuye.me/article/simulate-weibo-login-in-python/，這篇是12年的文章，現在是15年4月，登陸頁面地址和請求數據等都發生了改變，需要做一些修改。

=====================================================

試一下方法2，邊做邊寫：

原始代碼來自http://qinxuye.me/article/simulate-weibo-login-in-python/，現粘貼到下方:

 1 import urllib2
 2 import urllib
 3 import cookielib
 4 
 5 import lxml.html as HTML
 6 
 7 class Fetcher(object):
 8     def __init__(self, username=None, pwd=None, cookie_filename=None):
 9         self.cj = cookielib.LWPCookieJar()
10         if cookie_filename is not None:
11             self.cj.load(cookie_filename)
12         self.cookie_processor = urllib2.HTTPCookieProcessor(self.cj)
13         self.opener = urllib2.build_opener(self.cookie_processor, urllib2.HTTPHandler)
14         urllib2.install_opener(self.opener)
15         
16         self.username = username
17         self.pwd = pwd
18         self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; rv:14.0) Gecko/20100101 Firefox/14.0.1',
19                         'Referer':'','Content-Type':'application/x-www-form-urlencoded'}
20     
21     def get_rand(self, url):
22         headers = {'User-Agent':'Mozilla/5.0 (Windows;U;Windows NT 5.1;zh-CN;rv:1.9.2.9)Gecko/20100824 Firefox/3.6.9',
23                    'Referer':''}
24         req = urllib2.Request(url ,urllib.urlencode({}), headers)
25         resp = urllib2.urlopen(req)
26         login_page = resp.read()
27         rand = HTML.fromstring(login_page).xpath("//form/@action")[0]
28         passwd = HTML.fromstring(login_page).xpath("//input[@type='password']/@name")[0]
29         vk = HTML.fromstring(login_page).xpath("//input[@name='vk']/@value")[0]
30         return rand, passwd, vk
31     
32     def login(self, username=None, pwd=None, cookie_filename=None):
33         if self.username is None or self.pwd is None:
34             self.username = username
35             self.pwd = pwd
36         assert self.username is not None and self.pwd is not None
37         
38         url = 'http://3g.sina.com.cn/prog/wapsite/sso/login.php?ns=1&revalid=2&backURL=http%3A%2F%2Fweibo.cn%2F&backTitle=%D0%C2%C0%CB%CE%A2%B2%A9&vt='
39         rand, passwd, vk = self.get_rand(url)
40         data = urllib.urlencode({'mobile': self.username,
41                                  passwd: self.pwd,
42                                  'remember': 'on',
43                                  'backURL': 'http://weibo.cn/',
44                                  'backTitle': '新浪微博',
45                                  'vk': vk,
46                                  'submit': '登錄',
47                                  'encoding': 'utf-8'})
48         url = 'http://3g.sina.com.cn/prog/wapsite/sso/' + rand
49         req = urllib2.Request(url, data, self.headers)
50         resp = urllib2.urlopen(req)
51         page = resp.read()
52         link = HTML.fromstring(page).xpath("//a/@href")[0]
53         if not link.startswith('http://'): link = 'http://weibo.cn/%s' % link
54         req = urllib2.Request(link, headers=self.headers)
55         urllib2.urlopen(req)
56         if cookie_filename is not None:
57             self.cj.save(filename=cookie_filename)
58         elif self.cj.filename is not None:
59             self.cj.save()
60         print 'login success!'
61         
62     def fetch(self, url):
63         print 'fetch url: ', url
64         req = urllib2.Request(url, headers=self.headers)
65         return urllib2.urlopen(req).read()

看一下登錄過程，用firefox的httpfox工具：

(1)進入登陸頁面，得到url

對照代碼中38,39行，需要改url為以上頁面的url：

38         url = 'http://3g.sina.com.cn/prog/wapsite/sso/login.php?ns=1&revalid=2&backURL=http%3A%2F%2Fweibo.cn%2F&backTitle=%D0%C2%C0%CB%CE%A2%B2%A9&vt='
39         rand, passwd, vk = self.get_rand(url)

再看一下39行用到的get_rand()：

*從第7行開始按照代碼，查找頁面元素，可看到rand, passwd, vk這些值在哪里：

（注意這里的headers又被定義了一遍，而在_init_()中已經給定了self.headers，暫且注釋掉，並且根據實際所用的firefox版本（以上頁面Request Header中的user-agent項）改掉_init_中的self.headers.）

（這里得到的password不是用戶密碼，是登陸request需要提交數據項的名稱，后面有解釋）

 1 def get_rand(self, url):##############get rand, see html page 
 2         #headers = {'User-Agent':'Mozilla/5.0 (Windows;U;Windows NT 5.1;zh-CN;rv:1.9.2.9)Gecko/20100824 Firefox/3.6.9',
 3         #           'Referer':''}  #####why different?
 4         req = urllib2.Request(url ,urllib.urlencode({}), self.headers)
 5         resp = urllib2.urlopen(req)
 6         login_page = resp.read()
 7         rand = HTML.fromstring(login_page).xpath("//form/@action")[0]
 8         passwd = HTML.fromstring(login_page).xpath("//input[@type='password']/@name")[0]
 9         vk = HTML.fromstring(login_page).xpath("//input[@name='vk']/@value")[0]
10         return rand, passwd, vk

輸入用戶名和密碼：

輸入用戶名和密碼之后，沒有新的響應，說明新浪沒有對這兩個字段做加密處理，這是和微博pc端weibo.com不同的地方。

（2）提交post

可看到有4個重定向，和一個200的最終url。其中第一個重定向是POST方法，其余的都是GET。

先看POST的重定向：

看一下原始代碼的login()部分40到49行，

40         data = urllib.urlencode({'mobile': self.username,
41                                  passwd: self.pwd,
42                                  'remember': 'on',
43                                  'backURL': 'http://weibo.cn/',
44                                  'backTitle': '新浪微博',
45                                  'vk': vk,
46                                  'submit': '登錄',
47                                  'encoding': 'utf-8'})
48         url = 'http://3g.sina.com.cn/prog/wapsite/sso/' + rand
49         req = urllib2.Request(url, data, self.headers)

（其中password的“：”前的部分是一個變量，原因在下一張截圖中）

rand, password, vk字段都是在get_rand()時已經得到。看一下登陸request中需要的具體數據項data:

password后是一個“_”另加四位數字。這個數字不是固定的，這是我第二次訪問，和第一次訪問的數字已經不同了。所以在get_rand()中用變量存儲password；

value部分有一些亂碼。頁面是utf-8編碼的，把最下面的Pretty改成Raw顯示得到utf-8編碼的形式，整理一下：

1 mobile=******2 password_2854=*******  ##這是上一次登陸的password名稱3 remember=on
4 backURL=http%253A%252F%252Fweibo.cn
5 backTitle=%E6%89%8B%E6%9C%BA%E6%96%B0%E6%B5%AA%E7%BD%91
6 tryCount=
7 vk=2854_cfe0_1714426041
8 submit=%E7%99%BB%E5%BD%95

這些value值段的是utf-8編碼后的形式。用網上的在線utf-8轉換，得到的是和上圖一樣的亂碼。。Anyway，那就直接用utf-8編碼后的結果。

#############update 4.21############

說錯了，這里不是utf-8編碼，而是HTML轉義序列的unicode解碼， HTML轉義序列參見http://www.zhihu.com/question/21390312的最高票回答；

%8B%E6%9C%BA%E6%96%用unicode編碼之后成為“&#, &#”格式的HTML轉義序列，character reference, "從 HTML 4 開始，NCR 以 Unicode 為准，與文檔編碼無關"。

網頁編碼是utf-8而不是報文編碼。我的火狐瀏覽器設置的編碼也是utf-8，所以解析出來是亂碼

#################################

用本地看到的POST data改一下原始代碼中的data部分，同樣也改一下POST請求的url：

 1 　　　　data = urllib.urlencode({'mobile':self.username,
 2                                   passwd :self.pwd,
 3                                  'remember':'on',
 4                                  'backURL':'http://weibo.cn/',
 5                                  #'backTitle':u''.decode('utf-8').encode('utf-8'),
 6                                  #'backTitle':'&#x65B0&#x6D6A&#x5FAE&#x535A',
 7                                  'backTitle':'%E6%89%8B%E6%9C%BA%E6%96%B0%E6%B5%AA%E7%BD%91',
 8                                  'tryCount':'',
 9                                  'vk':vk,
10                                  #'submit':u''.decode('utf-8').encode('utf-8'),
11                                  #'submit':'&#x767B&#x9646',
12                                  'submit':'%E7%99%BB%E5%BD%95'})
13         url = 'http://login.weibo.cn/login/?' + rand + '&backURL=http%3A%2F%2Fweibo.cn%2F&backTitle=%E5%BE%AE%E5%8D%9A&vt=4&revalid=2&ns=1'

（原始代碼定位元素用了lxml這個包。另一個方案是用BeautifulSoup。雖然lxml的速度據說比BeautifulSoup快10倍。。但是lxml安裝比較麻煩。。所以先改為BeautifulSoup實現。）

(3)通過GET得到gsid，得到登陸url，重定向到最終url "weibo.cn/&vt=4"

再看GET帶gsid的url的重定向到主頁，看一下GET報文：

代碼中這部分無法理解，新浪應該已經改了：

48         url = 'http://login.weibo.cn/' + rand
49         req = urllib2.Request(url, data, self.headers)
50         resp = urllib2.urlopen(req)
51         page = resp.read()
52         link = HTML.fromstring(page).xpath("//a/@href")[0]
53         if not link.startswith('http://'): link = 'http://weibo.cn/%s' % link
54         req = urllib2.Request(link, headers=self.headers)

如上上圖，需要GET weibo.cn/ + gsid，gsid的值在cookie中，如圖：

cookie中的gsid_CTandWT字段的值即是。是由weibo.cn存儲的。

看一下四個重定向：

第一個login.weibo.cn里面沒有這個cookie；第三個passport.weibo.com是weibo.com的，沒有gsid；第二個和第四個里面有。但是gsid_CTandWM不是由第二個get產生的，因為第二個的報文中的cookie行里已經用了gsid。

那只能認為第一個的POST就得到了gsid值，雖然response報文里什么都沒有（估計是對用戶隱蔽吧）。

參考這篇文章http://blog.csdn.net/xiexieni057/article/details/16698787里處理cookie的方法：

 1         #get an important cookie value gsid from cookiejar  
 2         try:  
 3             beginPos = str(self.cookie).index('gsid_CTandWM')  
 4             endPos = str(self.cookie).index('for', beginPos)  
 5         except Exception as e:  
 6             print e  
 7         if beginPos >= endPos:  
 8             print "cookie was changed by sina"  
 9         else:  
10             cookie_value = str(self.cookie)[beginPos: endPos]  
11         self.headers2 = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0',  
12                          'cookie': cookie_value}  
13         #use gsid to login
14         req = urllib2.Request(url="http://weibo.cn",  
15                                   data=urllib.urlencode({}),  
16                                   headers=self.headers2)  
17         urllib2.urlopen(req)

貼一下改完后的代碼：

 1 # -*- coding : utf-8 -*-
 2 import re 
 3 import urllib2   
 4 import random 
 5 import urllib
 6 import cookielib
 7 import socket
 8  
 9 #import lxml.html as HTML
10 from bs4 import BeautifulSoup
11 
12 class Fetcher(object):
13     def __init__(self, iplist=[], username=None, pwd=None, cookie_filename=None):
14         if iplist != []:
15             self.iplist = iplist
16             proxy_ip = random.choice(self.iplist) 
17             self.proxy_handler = urllib2.ProxyHandler({'http': 'http://' + proxy_ip})
18         else:
19             self.proxy_handler = urllib2.ProxyHandler()
20 
21         self.cj = cookielib.LWPCookieJar()
22         ##cookie jar, use self.cj to restore the cookies generated in the process
23         if cookie_filename is not None:
24             self.cj.load(cookie_filename)
25         self.cookie_processor = urllib2.HTTPCookieProcessor(self.cj)
26         self.opener = urllib2.build_opener(self.cookie_processor, urllib2.HTTPHandler, self.proxy_handler)
27         urllib2.install_opener(self.opener)
28         #urllib2 set environment(?) 
29         
30         self.username = username #
31         self.pwd = pwd #
32         self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0',
33                         'Referer':'', 'Content-Type':'application/x-www-form-urlencoded'}
34      
35     def get_rand(self, url):##############get rand, see html page 
36         req = urllib2.Request(url ,urllib.urlencode({}), self.headers)
37         resp = urllib2.urlopen(req, timeout = 10)
38         login_page = resp.read()
39         soup = BeautifulSoup(login_page)  ####need fix?
40         rand = soup.form['action'][6:15]
41         passwd =  soup.find_all('input', attrs = {"type":"password"})[0]['name']
42         vk = soup.find_all('input', attrs = {"name":"vk"})[0]['value']
43         print rand, passwd, vk
44         return rand, passwd, vk
45      
46     def login(self, username=None, pwd=None, cookie_filename=None):
47         if self.username is None or self.pwd is None:
48             self.username = username
49             self.pwd = pwd
50         assert self.username is not None and self.pwd is not None
51         print self.pwd, self.username
52          
53         url = 'http://login.weibo.cn/login/?ns=1&revalid=2&backURL=http%3A%2F%2Fweibo.cn%2F&backTitle=%CE%A2%B2%A9&vt=' #use this url to get rand
54         rand, passwd, vk = self.get_rand(url)  ###sucess
55         data = urllib.urlencode({'mobile':self.username,
56                                   passwd :self.pwd,
57                                  'remember':'on',
58                                  'backURL':'http://weibo.cn/',
59                                  'backTitle':'%E6%89%8B%E6%9C%BA%E6%96%B0%E6%B5%AA%E7%BD%91',
60                                  'tryCount':'',
61                                  'vk':vk,
62                                  'submit':'%E7%99%BB%E5%BD%95'})
63         url = 'http://login.weibo.cn/login/?' + rand + '&backURL=http%3A%2F%2Fweibo.cn%2F&backTitle=%E5%BE%AE%E5%8D%9A&vt=4&revalid=2&ns=1'
64         resp = urllib2.urlopen(req, timeout = 10) 
65 
66         #get gsid from cookie
67         beginPos = str(self.cj).index('gsid_CTandWM')  
68         endPos = str(self.cj).index('for', beginPos) 
69         if beginPos >= endPos:  
70             print "cookie was changed by sina"  
71         else:  
72             self.cookie_value = str(self.cj)[beginPos: endPos] 
73         self.headers2 = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0',  
74                      'cookie': self.cookie_value}  
75 
76         req = urllib2.Request(url="http://weibo.cn",  
77                               data=urllib.urlencode({}),  
78                               headers=self.headers2)  
79         urllib2.urlopen(req, timeout = 10)
80 
81         #save cookie        
82         if cookie_filename is not None:
83             self.cj.save(filename=cookie_filename)
84         elif self.cj.filename is not None:
85             self.cj.save()
86         print 'login success!'
87          
88     def fetch(self, url):
89         print 'fetch url: ', url 
90         req = urllib2.Request(url, headers=self.headers)  #specify the headers parameter
91         return urllib2.urlopen(req, timeout = 10).read()
92

============================================================================

需要改的部分基本就完成了。現在測試：

加了兩個print語句，分別打印最終得到的cookie和gsid值。能看到成功了！訪問weibo.cn，並將網頁內容打印到文件，也是主頁的狀態。

總結：

1、POST要求的data部分，沒有寫中文（因為是亂碼），而是按照raw格式顯示的內容原樣粘貼上去。

2、最后測試出現問題的都是細節，比如GET要求的語句格式啊，data格式啊，要對自己有信心。。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python模擬登陸新浪微博 Python爬取新浪微博評論新浪微博模擬登陸+數據抓取(java實現) 使用selenium模擬登陸新浪微博 Java模擬新浪微博登陸抓取數據 python機器登陸新浪微博代碼示例 CasperJs模擬登陸人人網（新浪微博登陸有問題）新浪微博SSO登陸機制新浪微博登陸，獲取token 使用JAVA實現模擬登陸並發送新浪微博（非調用新浪API）