python爬蟲:爬取鳳凰指數


在知乎上看到的這個問題,講講我爬取過程中遇到的問題:

1.循環爬取其他頁面,在其他項目中用循環一般可以搞定,可是這個,第一頁和第二第三頁的表格是不同的,所以要重新寫規則,我懶,寫了第一頁后,就不想在寫第二第三頁了;
2.亂碼問題,我用request爬取,遇到了亂碼,后來強制改為utf-8解決了;
代碼如下:
 
#!/usr/bin/python
# -*- encoding:utf-8 -*-

'''
源網址:http://hz.house.ifeng.com/detail/2014_10_28/50087618_1.shtml
項目來源:https://www.zhihu.com/question/26385408
時間:2016-05-19
'''

import requests
from bs4 import BeautifulSoup
from pandas import DataFrame
'''
我想去掉面積中的那個㎡,可是報了
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
這個錯誤,所以就加上了下面三句,解決
'''
import sys
reload(sys)
sys.setdefaultencoding('utf8')

def get_info():
xuhao=[]
project_name=[]
project_strict=[]
project_sale_num=[]
project_order_num=[]
project_sale_area=[]
project_ave_price=[]

baseurl='http://hz.house.ifeng.com/detail/2014_10_28/50087618_'

page_num=1
url=baseurl+str(page_num)+'.shtml'
response=requests.get(url)
response.encoding = 'utf-8' #requests強制編碼為utf_8
# print response.encoding 查看requests的編碼方式

soup=BeautifulSoup(response.text,'lxml')
arcicle=soup.find('div',{'class':'article'})
tr=arcicle.find_all('tr')
for i in range(2,len(tr)-1):
td=tr[i].find_all('td')

xuhao.append(td[0].string.strip())
project_name.append(td[1].string.strip())
project_strict.append(td[2].string.strip())
project_sale_num.append(td[3].string.strip())
project_order_num.append(td[4].string.strip())
project_sale_area.append(td[5].string.replace('','').strip())
project_ave_price.append(td[6].string.strip())

df=DataFrame(xuhao,columns=['xuhao'])
df['name']=DataFrame(project_name)
df['strict']=DataFrame(project_strict)
df['sale_num']=DataFrame(project_sale_num)
df['order_num']=DataFrame(project_order_num)
df['area']=DataFrame(project_sale_area)
df['ave_price']=DataFrame(project_ave_price)
return df


if __name__=='__main__':

page=get_info()
print page

我用pytharm跑出來的結果大概是這個樣子的:





免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM