新浪微博的開放平台的開發者日益活躍,除了商業因素外還有很大的一股民間工程師力量;大量熱衷於群體行為研究與自然語言處理以及機器學習和數據挖掘的研究者 and 攻城師們開始利用新浪真實的數據和平台為用戶提供更好的應用或者發現群體的行為規律包括一些統計信息,本文就是利用新浪開放平台提供的API對微博的用戶標簽進行分詞處理,然后根據分詞后的關鍵字給用戶推薦感興趣的人,在此記錄下以備后用。
requisition:
python+sinaWeibo python SDK+ICTCLAS
備注:ICTCLAS是中國科學院計算技術研究所提供的中文分詞包
開始上代碼:
1.先要注冊新浪開發者以獲得APP_KEY和APP_SECRET
2.根據python SDK的howto根據Authou2機制獲得授權(得到code進而得到access_token與expires_in),代碼如下:
1 #-*-coding:UTF-8-*-
2 '''
3 Created on 2012-12-10
4
5 @author: jixianwu
6 '''
7 from weibo import APIClient,APIError
8 import urllib,httplib
9
10 class AppClient(object):
11 ''' initialize a app client '''
12 def __init__(self,*aTuple):
13 self._appKey = aTuple[0] #your app key
14 self._appSecret = aTuple[1] #your app secret
15 self._callbackUrl = aTuple[2] #your callback url
16 self._account = aTuple[3] #your weibo user name (eg.email)
17 self._password = aTuple[4] # your weibo pwd
18 self.AppCli = APIClient(app_key=self._appKey,app_secret=self._appSecret,redirect_uri=self._callbackUrl)
19 self._author_url = self.AppCli.get_authorize_url()
20 self._getAuthorization()
21
22 def __str__(self):
23 return 'your app client is created with callback %s' %(self._callbackUrl)
24
25 def _get_code(self):#使用該函數避免了手動輸入code,實現了模擬用戶授權后獲得code的功能
26 conn = httplib.HTTPSConnection('api.weibo.com')
27 postdict = {"client_id": self._appKey,
28 "redirect_uri": self._callbackUrl,
29 "userId": self._account,
30 "passwd": self._password,
31 "isLoginSina": "0",
32 "action": "submit",
33 "response_type": "code",
34 }
35 postdata = urllib.urlencode(postdict)
36 conn.request('POST', '/oauth2/authorize', postdata, {'Referer':self._author_url,'Content-Type': 'application/x-www-form-urlencoded'})
37 res = conn.getresponse()
38 location = res.getheader('location')
39 code = location.split('=')[1]
40 conn.close()
41 return code
42
43 def _getAuthorization(self):#將上面函數獲得的code再發送給新浪認證服務器,返回給客戶端access_token和expires_in,有了這兩個東西,咱就可以調用api了
44 ''' get the authorization from sinaAPI with oauth2 authentication method '''
45 code = self._get_code()
46 r = self.AppCli.request_access_token(code)
47 access_token = r.access_token # The token return by sina
48 expires_in = r.expires_in
49 self.AppCli.set_access_token(access_token, expires_in)
3.根據api獲得用戶標簽:
1 def getTags(self,userid):
2 ''' get last three tags stored by weight of this user'''
3 try:
4 tags = self.AppCli.tags.get(uid=userid)
5 except Exception:
6 print 'get tags failed'
7 return
8 userTags = []
9 sortedT = sorted(tags,key=operator.attrgetter('weight'),reverse=True)
10 if len(sortedT) > 3:
11 sortedT = sortedT[-3:]
12 for tag in sortedT:
13 for item in tag:
14 if item != 'weight':
15 userTags.append(tag[item])
16 return userTags
4.獲得用戶以關注的人:
1 def getFocus(self,userid):
2 ''' get focused users list by current user '''
3 focus = self.AppCli.friendships.friends.ids.get(uid=userid)
4 try:
5 return focus.get('ids')
6 except Exception:
7 print 'get focus failed'
8 return
5.對3中獲得的用戶標簽進行分詞處理:(之前要寫個class進行分詞處理,本文最后給出完整源碼)
1 from wordSegmentation import tokenizer
2
3 tkr = tokenizer()
4 #concatenate all the tags of the user into a string ,then segment the string
5 for tag in userTags:
6 utf8_tag = tag.encode('utf-8')
7 #print utf8_tag
8 lstrwords += utf8_tag
9 words = tkr.parse(lstrwords)
6.根據5中獲得的關鍵詞+新浪api中搜索接口最終給出用戶未關注但感興趣的用戶:
1 for keyword in words:
2 print keyword.decode('utf-8').encode('gbk')
3 searchUsers = self.AppCli.search.suggestions.users.get(q=keyword.decode('utf-8'),count=10)
4
5 #recommendation the top ten users
6 '''
7 if len(searchUsers) >6:
8 searchUsers = searchUsers[-6:]
9 '''
10 for se_user in searchUsers:
11 #print se_user
12 uid = se_user['uid']
13 #filter those had been focused by the current user
14 if uid not in userFocus:
15 recommendUsers[uid] = se_user['screen_name'].encode('utf-8')
------
實際運行:
下面是自己微博的例子,我的標簽是:

運行推薦程序后得到的結果為:

紅線框中為推薦結果,這些微博用戶都是與被推薦用戶標簽一致並具有較高影響力,同時也是最有可能給用戶傳遞效用較高信息的用戶。(圖中只標注了部分用戶)
到此,真個推薦任務完成,完整源碼在個github上,還望感興趣的同學指正。
