Day05回顧
1、json模塊
1、json.loads()
json格式(對象、數組) -> Python格式(字典、列表)
2、json.dumps()
Python格式(字典、列表、元組) -> json格式(對象、數組)
2、Ajax動態加載
1、F12 -> Query String Data
2、params={QueryString中一堆的查詢參數}
3、URL地址 :F12抓到的GET地址
3、selenium+phantomjs
1、phantomjs : 無界面瀏覽器(在內存中執行頁面加載)
2、使用步驟
1、導入模塊
from selenium import webdriver
2、創建瀏覽器對象
driver = webdriver.PhantomJS(executable_path='')
3、發請求,獲取頁面信息
driver.get(url)
4、查找節點位置
text = driver.find_element_by_class_name("")
5、發送文字
text.send_keys("")
6、點擊
button = driver.find_element_by_id("")
button.click()
7、關閉
driver.quit()
3、常用方法
1、driver.get(url)
2、driver.page_source
3、driver.page_source.find("字符串")
-1 :沒有找到,失敗
4、driver.find_element_by_id("")
5、driver.find_element_by_name("")
6、driver.find_element_by_class_name("")
7、driver.find_enlement_by_xpath("")
8、對象名.send_keys("")
9、對象名.click()
10、對象名.text
11、driver.quit()
5、selenium+chromedriver
1、下載對應版本的安裝包
2、如何設置無界面模式
option = webdriver.ChromeOpitons()
option.set_headless()
option.add_argument('windows-size=1920x3000')
driver = webdriver.Chrome(options=option)
driver.get(url)
*********************************
Day06筆記
一、京東商品抓取案例
見 :01_京東商品抓取(執行JS腳本).py
1、目標
商品名稱、商品價格、評論數量、商家名稱
2、xpath匹配每個商品的節點對象
//div[@id="J_goodsList"]//li
3、關於下一頁
下一頁按鈕(能點) : class的值為pn-next
下一頁按鈕(不能點) : class的值為pn-next disabled
二、多線程爬蟲
1、進程
1、系統中正在運行的一個應用程序
2、1個CPU核心1次只能執行1個進程,其他進程都屬於非運行狀態
3、N個CPU核心可同時執行N個任務
2、線程
1、進程中包含的執行單元,1個進程可包含多個線程
2、線程使用所屬進程空間(1次只能執行1個線程,阻塞)
3、GIL :全局解釋鎖
執行通行證,僅此1個,誰拿到了通行證誰執行,否則等
4、應用場景
1、多進程 :大量的密集計算
2、多線程 :I/O操作密集
爬蟲 :網絡I/O密集
寫文件 :本地磁盤I/O
5、百思不得其姐多線程案例
1、網址 :http://www.budejie.com/1
2、目標 :段子內容
3、xpath表達式
//div[@class="j-r-list-c-desc"]/a/text()
4、知識點
1、隊列(from queue import Queue)
put()
get()
Queue.empty() :是否為空
Queue.join() :如果隊列為空,執行其他程序
2、線程(import threading)
threading.Thread(target=......)
5、代碼實現
三、BeautifulSoup解析
1、定義 :HTML或XML的解析器,依賴於lxml
2、安裝 :python -m pip install beautifulsoup4
導入 :from bs4 import BeautifulSoup
3、使用流程
1、導模塊 :from bs4 import BeautifulSoup
2、創建解析對象
soup = BeautifulSoup(html,'lxml')
3、查找節點對象
r_list = soup.find_all("div",attrs={"class":"test"})
4、見示例代碼

from bs4 import BeautifulSoup html = '<div class="fengyun">雄霸</div>' # 創建解析對象 soup = BeautifulSoup(html,'lxml') r_list = soup.find_all("div",attrs={"class":"fengyun"}) # get_text()方法或string屬性來獲取節點的文本內容 for r in r_list: print(r.get_text()) print(r.string)

from bs4 import BeautifulSoup html = ''' <div class="test1">雄霸</div> <div class="test1">幽若</div> <div class="test2"> <span>第二夢</span> </div> ''' # 找到所有class為test1的div節點的文本 soup = BeautifulSoup(html,'lxml') r_list = soup.find_all("div", attrs={"class":"test1"}) for r in r_list: print(r.get_text(),r.string) # 找到class為test2的div節點下span節點的文本 r_list = soup.find_all("div", attrs={"class":"test2"}) for r in r_list: print(r.span.string)
5、BeautifulSoup支持的解析庫
1、lxml :soup = BeautifulSoup(html,'lxml')
速度快,文檔容錯能力強
2、html.parser :Python標准庫
速度、容錯能力都一般
3、xml
速度快,文檔容錯能力強
6、節點選擇器
選擇節點並獲取內容
節點對象.節點名.string
7、find_all() :返回列表
r_list = soup.find_all("節點名",attrs={"節點屬性名":"屬性值"})
優先推薦使用正則或者xpath,效率高一些。

import requests from bs4 import BeautifulSoup url = "https://www.qiushibaike.com/" headers = {"User-Agent":"Mozilla/5.0"} # 獲取頁面源碼 res = requests.get(url,headers=headers) res.encoding = "utf-8" html = res.text # 創建解析對象並解析 soup = BeautifulSoup(html,'lxml') r_list = soup.find_all("div", attrs={"class":"content"}) # for循環遍歷 i = 1 for r in r_list: print(r.span.get_text().strip()) i += 1 print("*" * 30) print(i)
四、Scrapy框架
1、定義
異步處理框架,可配置和可擴展程度非常高,Python中使用最廣泛的爬蟲框架
2、安裝(Ubuntu)
1、安裝依賴庫(包)
1、sudo apt-get install python3-dev
2、sudo apt-get install python-pip
3、sudo apt-get install libxml2-dev
5、sudo apt-get install libxslt1-dev
6、sudo apt-get install zlib1g-dev
7、sudo apt-get install libffi-dev
8、sudo apt-get install libssl-dev
2、安裝Scrapy
sudo pip3 install Scrapy ####注意此處S為大寫
3、驗證 (python交互模式)
>>>import scrapy ####注意此處s為小寫
4、創建項目出現警告解決方案
創建項目:scrary startproject AAA
報錯:"Warning : .... Cannot import OpenTpye ...."
因為 pyasn1 版本過低,將其升級即可
sudo pip3 install pyasn1 --upgrade
五、Scrapy框架五大組件
1、引擎(Engine) :整個框架的核心
2、調度器(Scheduler):接受從引擎發過來的URL,入隊列
3、下載器(Downloader):獲取網頁源碼,返給爬蟲程序
4、下載器中間件(Downloader Middlewares)
蜘蛛中間件(Spider Middlewares)
5、項目管道(Item Pipeline) :數據處理
六、Scrapy框架詳細抓取流程
七、制作Scrapy爬蟲項目的步驟
1、新建項目
scrapy startproject 項目名
2、明確目標(items.py)
3、制作爬蟲程序
進入到spiders文件夾中,執行:
scrapy genspider 爬蟲名 "域名"
e.g : scrapy genspider baiduspider www.baidu.com
4、處理數據(pipelines.py)
5、項目全局配置settings.py
6、運行爬蟲程序
scrapy crawl 爬蟲名
八、scrapy項目結構
Baidu
├── Baidu :項目目錄
│ ├── __init__.py
│ ├── items.py :定義爬取數據結構
│ ├── middlewares.py :下載器中間件和蜘蛛中間件
│ ├── pipelines.py :管道文件,處理數據
│ ├── settings.py :項目全局配置
│ └── spiders :文件夾,存放爬蟲程序
│ ├── baiduspider.py :爬蟲程序/spider
│ └──__init__.py
│
└── scrapy.cfg :項目基本配置文件,不用改
九、文件配置詳解
1、settings.py
# 設置user_agent(自己改)
USER_AGENT = 'Baidu (+http://www.yourdomain.com)'
# 是否遵循robots協議,改為False
ROBOTSTXT_OBEY = False
# 最大並發量,默認為16個(沒有特殊需求不用改)
CONCURRENT_REQUESTS = 6
# 下載延遲時間
DOWNLOAD_DELAY = 1
# 請求報頭
DEFAULT_REQUEST_HEADERS = {
# 'User-Agent':'Mozilla/5.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
# 蜘蛛中間件(不常用)
SPIDER_MIDDLEWARES = {
'Baidu.middlewares.BaiduSpiderMiddleware': 543,
}
# 下載器中間件
DOWNLOADER_MIDDLEWARES = {
'Baidu.middlewares.BaiduDownloaderMiddleware': 543,
}
# 項目管道,處理數據(重要)
ITEM_PIPELINES = {
'Baidu.pipelines.BaiduPipelineMySQL': 300, #300為優先級,1-1000,數字小,則優先級高。
'Baidu.pipelines.BaiduPipelineMongo': 200,
}
十、項目 :抓取百度首頁源碼,存到 百度.html中
1、scrapy startproject Baidu
2、cd Baidu/Baidu
3、subl items.py(此步不用操作)
4、cd spiders
5、scrapy genspider baidu "www.baidu.com"
6、subl baidu.py
# 爬蟲名
# 域名 : 重點檢查
# start_urls : 重點檢查
def parse(self,respose):
with open("百度.html","w") as f:
f.write(response.text)
7、cd ../
8、subl settings.py
1、把robots改為False
2、添加User-Agent
DEFAULT_REQUEST_HEADERS = {
'User-Agent','Mozilla/5.0',
... ...
}
9、cd spiders 運行 scrapy crawl baidu

# -*- coding: utf-8 -*- import scrapy class BaiduSpider(scrapy.Spider): # 爬蟲名,運行爬蟲時的名字 name = 'baidu' # 允許爬取的域名 allowed_domains = ['www.baidu.com'] # 開始要爬取的URL start_urls = ['http://www.baidu.com/'] # parse函數名不能改 def parse(self, response): with open("百度.html","w") as f: f.write(response.text)

# -*- coding: utf-8 -*- # Scrapy settings for Baidu project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'Baidu' SPIDER_MODULES = ['Baidu.spiders'] NEWSPIDER_MODULE = 'Baidu.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'Baidu (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: DEFAULT_REQUEST_HEADERS = { 'User-Agent':'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', } # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'Baidu.middlewares.BaiduSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'Baidu.middlewares.BaiduDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'Baidu.pipelines.BaiduPipeline': 300, #} # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'