滲透利器-kali工具 (第四章-5) 爬蟲入門

本文轉載自查看原文 2020-05-18 07:09 942

本文內容：

交換機制
網頁解析
爬蟲所需的模塊和庫
目錄掃描工具原理實戰

Python爬蟲入門[spider]

1，交換機制：

　　服務器與本地的交換機制：

　　　　http協議：客戶端與服務器一種會話的方式。

　　　　客戶端-------[requests[請求]]------->服務器

　　　　客戶端-------[response[響應]]------>服務器

　　HTTP請求：

　　　　向服務器請求的時候使用request請求，包含了很多的不同的方法：主要用到[GET、POST]

　　HTTP響應：

　　　　向服務器提出request之后，服務器會返回給我們一個Response[我們請求的這個網頁]

　　　　RESPONST：

　　　　　　Status_code：[狀態碼]200 網頁中的元素

　　　　　　Status_code：[狀態碼]403/404

　　　　可以打開谷歌，加載一個網站然后點擊檢查>>>nerwork>>>刷新

2，網頁解析：

　　1.網頁解析：需要使用到[bs4]

　　　　from bs4 import BeautifulSoup

　　　　import requests

　　　　#解析網頁內容

　　　　url = "https://www.baidu.com"

　　　　wb_data = requests.get(url)

　　　　soup = BeautifulSoup(wb_data.text,'lxml')

　　　　print(soup)

　　2.描述要爬取的元素位置：

　　　　eg：標題[在網頁中找到它在的位置] >>>右鍵復制selector

　　　　　　titles = soup.select('#sy_load > ul:nth-child(2) > li:nth-child(1) > div.syl_info > a')

　　　　　　print(titles)

　　　　　　解釋：

　　　　　　　　#sy_load > ul:nth-child(2) > li:nth-child(1) > div.syl_info：標簽的位置：selector

　　　　　　　　a：查找a標簽

　　　　向上查找[上級標簽]class名：

　　　　　　titles = soup.select('div.syl_info> a')

　　　　　　print(titles)

　　　　　　解釋：

　　　　　　　　div.syl_info 標簽的class名

　　　　　　　　a：查找a標簽

　　3.bs4中具有一個BeautifulSoup安裝方法：

　　　　1.安裝：pip install beautifulsoup4

　　　　2.可選擇安裝解析器：

　　　　　　pip install lxml [一般安裝這個即可]

　　　　　　pip install html5lib

　　　　3.使用：

　　　　　　from bs4 import BeautifulSoup

　　　　　　import requests

　　　　　　req_obj = requests.get('https://www.baidu.com')

　　　　　　soup = BeautifulSoup(req_boj.txt,'lxml')

　　　　　　不使用BeautifulSoup，只返回狀態碼

　　　　　　使用BeautifulSoup，會將站點，html代碼返回。

　　　　4.經常使用到的一些方法：

　　　　　　from bs5 import BeautifulSoup

　　　　　　import requests,re

　　　　　　a = requests.get('https://www.baidu.com')

　　　　　　b = BeautifulSoup(a.txt,'lxml')

　　　　　　print(b.title)　　　　　輸入title找標簽只找一個

　　　　　　print(b.find('title'))　　　　輸入title找標簽只找一個

　　　　　　print(b.find_all('div'))　　找所有div標簽

　　　　　　c = soup.div　　　　　　創建div的實例化

　　　　　　print(c['id'])　　　　　　查看標簽的id屬性　　

　　　　　　print(c.attrs)　　　　　　查看標簽的所有屬性

　　　　　　d = soup.title　　　　　　創建title的實例化

　　　　　　print(d.string)　　　　　　獲取標簽里的字符串

　　　　　　e = soup.head　　　　　創建head的實例化

　　　　　　print(e.title)　　　　　　　獲取標簽，再獲取子標簽

　　　　　　f = soup.body　　　　　　創建body實例化

　　　　　　print(f.contents)　　　　　返回標簽子節點，以列表的形式返回

　　　　　　g = soup.title　　　　　　創建title實例化

　　　　　　print(g.parent)　　　　　　查找父標簽

　　　　　　print(soup.find_all(id='link2'))

3，爬蟲所需的模塊和庫：

　　庫：requests.bs4

　　模塊：BeautifulSoup

　　1.抓取：requests

　　2.分析：BeautifulSoup

　　3.存儲：

4.目錄掃描工具原理實戰：

　　import requests

　　import sys

　　url = sys.argv[1]

　　dic = sys.argv[2]

　　with open(dic,'r') as f:

　　　　for i in f.readlines()　　　　一行讀取

　　　　　　i = i.strip()　　　　　　去除空格

　　　　　　r = requests.get(url+i)

　　　　　　if r.stats_code == 200:

　　　　　　　　print('url:'+r.url)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 4 第四章實戰入門第四章可視化工具D3.js教程入門（第四章）—— 選擇插入刪除第四章-定積分 LabWindows/CVI入門之第四章：庫文件(轉) 第四章 istio快速入門(快速安裝) iSIGHT: 第四章 iSIGHT優化入門《metasploit滲透測試魔鬼訓練營》學習筆記第四章—web應用滲透操作系統——第四章第四章網絡層