Python 爬蟲知識點


一、基礎知識

1、HTML分析

2、urllib爬取

導入urilib包(Python3.5.2)

3、urllib保存網頁

import urllib.request
url = "http://www.cnblogs.com/wj204/p/6151070.html"
html = urllib.request.urlopen(url).read()
fh=open("F:/20_Python/3000_Data/2.html","wb")
fh.write(html)
fh.close()

4、模擬瀏覽器


import urllib.request
url="http://www.cnblogs.com/"
headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
data=opener.open(url).read()
fh=open("F:/20_Python/3000_Data/1.html","wb")
fh.write(data)
fh.close()

 

5、urllib保存圖片

 使用  http://www.bejson.com/  查看存儲在JS中的Json數據g_page_config

import re
import urllib.request
keyWord = "Python機器學習"
keyWord2 = urllib.request.quote(keyWord)
headers = ("User-Agent","MMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.1708.400 QQBrowser/9.5.9635.400")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
urllib.request.install_opener(opener)
url = "https://s.taobao.com/search?q=" + keyWord2 + "&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.50862.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20161214"
data = urllib.request.urlopen(url).read().decode("utf-8","ignore")
pat = 'pic_url":"//(.*?)"'#注意,該數據不在Html代碼之中,在全局腳本g_page_config
imageList = re.compile(pat).findall(data)
for j in range(0,len(imageList)):
try:
curImage = imageList[j]
curImageUrl = "http://" + curImage
file="F:/20_Python/3000_Data/" + str(j) + ".jpg"
print(file)
urllib.request.urlretrieve(curImageUrl,filename=file)
except urllib.error.URLError as e:
if hasattr(e,"code"):
print(e.code)
if hasattr(e,"reason"):
print(e.reason)
except Exception as e:
print(e)

 

6、正則表達式

 常用正則表達式爬取網頁信息及分析HTML標簽總結 http://blog.csdn.net/eastmount/article/details/51082253

 如對Python機器學習的正則分析:

pat = 'pic_url":"//(.*?)"'
re.compile(pat).findall(data)

提取(.*?),位於pic_url":"//和"之中

 

 
        

如對糗事百科的正則分析:

pat='<div class="content">.*?<span>(.*?)</span>.*?</div>'
datalist=re.compile(pat,re.S).findall(pagedata)

 

 
        

7、IP代理

 需要靠譜穩定的IP地址,找到合適的代理替換 proxy_addr

import urllib.request
import random
def use_proxy(url,proxy_addr):
proxy=urllib.request.ProxyHandler({"http":random.choice(proxy_addr)})
headers = ("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0")
opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
opener.addheaders = [headers]
urllib.request.install_opener(opener)
data=urllib.request.urlopen(url).read().decode("utf-8","ignore")
return data
proxy_addr=["45.64.166.142:8080","80.1.116.80:80","196.15.141.27:8080","47.88.6.158:8118","125.209.97.190 :8080"]
url="http://cuiqingcai.com/1319.html" #http://proxy.com.ru
data=use_proxy(url,proxy_addr)
print(len(data))

8、抓包分析

 

9、多線程爬取

 

import threading

class DownPage(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
print("處理下載業務業務")

downTask = DownPage()
downTask.start()


10、異常處理

 見:urllib保存圖片,使用try:except:捕獲異常

 

11、XPath

 http://www.cnblogs.com/defineconst/p/6181333.html

二、Scrapy安裝關聯包

 PyCharm---》File---》Settings---》Project..........

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM