爬取某個國外的網址,遇到的編碼問題 ,在前段頁面 返回的數據是
亞洲私人珍藏
;賣,令仝好分享他為此
所傾注的心血與熱愛。
爬蟲源碼是:
url = 'http://www.bonhams.com/auctions/24026/lot/120/?category=list&length=100&page=1' try: result = requests.get(url=url).text except: result = requests.get(url=url).text if 'javascript">setTimeout' in result: result = requests.get(url=url).text
如何處理?
url = 'http://www.bonhams.com/auctions/24026/lot/120/?category=list&length=100&page=1'
try:
result = requests.get(url=url).text except: result = requests.get(url=url).text if 'javascript">setTimeout' in result: result = requests.get(url=url).text
from HTMLParser import HTMLParser result_HTMLParser = HTMLParser().unescape(result) print result_HTMLParser
打印原始網頁代碼
發現編碼格式正常
html = '<abc>' 用Python可以這樣處理: import HTMLParser html_parser = HTMLParser.HTMLParser() txt = html_parser.unescape(html) #這樣就得到了txt = '<abc>' 如果還想轉回去,可以這樣: import cgi html = cgi.escape(txt) # 這樣又回到了 html = '<abc>'