手把手教python爬取漫畫(每一步都有注釋)

本文轉載自查看原文 2020-09-03 08:52 1374

本人也剛學，本帖水平含量不高，有什么問題請指教
想要編寫一個爬蟲，不管用什么語言最重要的都是先獲取所需要的內容在網頁中的位置，
就是說我們要獲取到他的唯一標識，就比如根據標簽的id或class，id和class獲取的區別在於，id是唯一的，所以只會獲取到一條數據，而class則不一樣，一個頁面可能會有多條class，
所以如果要根據class獲取數據，你需要找到你所需要的數據在第幾個class，當然除了根據id我們也可以根據標簽名來獲取，這個就更加寬泛了，接下來我們以爬取漫畫為例，手把手寫一個爬蟲，手把手奧（明確暗示）

1.首先我們找到要爬取的漫畫網站我這里以https://m.gufengmh8.com/為例，截圖為搜索頁面，可以看到網址為https://m.gufengmh8.com/search/?keywords=完美世界
keywords后面跟的就是要搜索的內容，然后我們獲取url的方式就可以是這樣

[Python] 純文本查看復制代碼

1	`"https://m.gufengmh8.com/manhua/search/?keywords="` `+` `str` `(` `input` `(` `"搜索漫畫:"` `))` `#input讓用戶輸入，獲取輸入內容`

<ignore_js_op>
2.然后我們開始對這個頁面進行剖析，我們要獲取的內容有哪些呢，在這里就不寫太復雜，只爬取漫畫名供用戶選擇就行，畢竟同名的漫畫也不多嘛（其實就是太懶）
瀏覽器按f12進入代碼調試，單擊下圖位置，然后可以看到class為itemBox,所以我們只需要獲取到這個頁面所有的class為itemBox的div，就可以獲取每本漫畫的所有信息，
在這里只取漫畫名，再用小箭頭點擊漫畫名，可以看到a標簽下的就是要獲取的漫畫名，所以邏輯就清晰了，先獲取class，然后遍歷class獲取到每個class中的itemTxt，然后再獲取到itemTxt的第一個節點
<ignore_js_op> <ignore_js_op> <ignore_js_op>
然后現在我們的代碼就變成這樣

[Python] 純文本查看復制代碼

 
                 import 
                 math 
                
                 import 
                 threading 
                
                 import 
                 time 
                
                 import 
                 os 
                
                 import 
                 requests 
                
                 from 
                 bs4  
                 import 
                 BeautifulSoup 
                
                 from 
                 urllib3.connectionpool  
                 import 
                 xrange 
                
                 #根據url獲取對應頁面的所有內容，然后返回 
                
                 def 
                 get_document(url): 
                
                 # print(url) 
                
                 try 
                 : 
                
                 get  
                 = 
                 requests.get(url) 
                 #打開連接 
                
                 data  
                 = 
                 get.content 
                 #獲取內容 
                
                 get.close() 
                 #關閉連接 
                
                 except 
                 : 
                 #拋異常就重試 
                
                 time.sleep( 
                 3 
                 ) 
                 #睡眠3秒，給網頁反應時間 
                
                 try 
                 :再次獲取 
                
                 get  
                 = 
                 requests.get(url) 
                
                 data  
                 = 
                 get.content 
                
                 get.close() 
                
                 except 
                 : 
                
                 time.sleep( 
                 3 
                 ) 
                
                 get  
                 = 
                 requests.get(url) 
                
                 data  
                 = 
                 get.content 
                
                 get.close() 
                
                 return 
                 data 
                
                 #下載漫畫 
                
                 def 
                 download_img(html): 
                
                 soup  
                 = 
                 BeautifulSoup(html) 
                 #BeautifulSoup和request搭配使用更佳呦 
                
                 itemBox  
                 = 
                 soup.find_all( 
                 'div' 
                 , attrs 
                 = 
                 { 
                 'class' 
                 :  
                 'itemBox' 
                 }) 
                 #find_all返回的是一個list 
                
                 for 
                 index, item  
                 in 
                 enumerate 
                 (itemBox): 
                 #遍歷itemBox，index是當前項list的下標，item是內容 
                
                 itemTxt  
                 = 
                 item.find( 
                 'div' 
                 , attrs 
                 = 
                 { 
                 'class' 
                 :  
                 'itemTxt' 
                 }) 
                 #因為只有一個，所以itemBox中只有一個itemTxt所以這次我們用find 
                
                 a  
                 = 
                 itemTxt.find( 
                 'a' 
                 , attrs 
                 = 
                 { 
                 'class' 
                 :  
                 'title' 
                 }).text[] 
                
                 print 
                 ( 
                 str 
                 (index 
                 + 
                 1 
                 ) 
                 + 
                 '.' 
                 + 
                 a) 
                
                 # download_img(get_document("https://m.gufengmh8.com/search/?keywords="+str(input("搜索漫畫:")))) 
                
                 download_img(get_document( 
                 "https://m.gufengmh8.com/search/?keywords=完美世界" 
                 )) 
                 #這個就不解釋了吧

執行后打印這樣

[Python] 純文本查看復制代碼

 
                 1. 
                 捕獲寵物娘的正確方法 
                
                 2. 
                 吾貓當仙 
                
                 3. 
                 百詭談 
                
                 4. 
                 完美世界PERFECTWORLD 
                
                 5. 
                 誅仙·御劍行 
                
                 6. 
                 洞仙歌 
                
                 7. 
                 貓仙生 
                
                 8. 
                 完美世界

3.現在我們基本實現了搜索功能，這已經算是個簡單爬蟲了，之后讓用戶輸入書籍序號，然后下載
我們隨便點進去一本漫畫，用之前的方式獲取到id為chapter-list-1的ul包含了所有的章節，ul中的每一個li又包含一個a標簽和span標簽，分別是url和章節名,之后就可以繼續寫了
<ignore_js_op> <ignore_js_op>

[Asm] 純文本查看復制代碼

 
                 def download_img(html): 
                
                 chapter_url_list=[] 
                
                 soup = BeautifulSoup(html)#BeautifulSoup和request搭配使用更佳呦 
                
                 itemBox = soup.find_all( 
                 'div' 
                 , attrs={ 
                 'class' 
                 :  
                 'itemBox' 
                 })#find_all返回的是一個list 
                
                 for 
                 index, item  
                 in 
                 enumerate(itemBox):#遍歷itemBox，index是當前項list的下標，item是內容 
                
                 itemTxt = item.find( 
                 'div' 
                 , attrs={ 
                 'class' 
                 :  
                 'itemTxt' 
                 })#因為只有一個，所以itemBox中只有一個itemTxt所以這次我們用find 
                
                 a = itemTxt.find( 
                 'a' 
                 , attrs={ 
                 'class' 
                 :  
                 'title' 
                 }) 
                
                 chapter_url = a[ 
                 'href' 
                 ] 
                
                 chapter_url_list.append(chapter_url)#把所有書的url存起來 
                
                 print( 
                 str 
                 (index+1)+ 
                 '.' 
                 +a 
                 .text 
                 ) 
                
                 number =  
                 int 
                 (input( 
                 '請輸入漫畫序號' 
                 )) 
                
                 chapter_html = BeautifulSoup(get_document(chapter_url_list[number-1]))#因為打印的序號和list的索引是相差1的,所以輸入的序號減一獲取對應書的url，再根據url獲取到目錄頁面 
                
                 ul = chapter_html.find( 
                 'ul' 
                 , attrs={ 
                 'id' 
                 :  
                 'chapter-list-1' 
                 })#獲取到ul 
                
                 li_list = ul.find_all( 
                 'li' 
                 )#獲取其中所有li 
                
                 for 
                 li  
                 in 
                 li_list:#遍歷 
                
                 li_a_href = li.find( 
                 'a' 
                 )[ 
                 'href' 
                 ]#注意這里獲取到的url是不完整的/manhua/buhuochongwuniangdezhengquefangfa/1000845.html

4.現在我們隨便點入一個章節獲取到漫畫圖片的位置
<ignore_js_op>

[Asm] 純文本查看復制代碼

 
                 chapter_html = BeautifulSoup(get_document( 
                 'https://m.gufengmh8.com' 
                 + li_a_href)) 
                
 
                          
                 chapter_content = chapter_html.find( 
                 'div' 
                 , attrs={ 
                 'class' 
                 :  
                 'chapter-content' 
                 }) 
                
 
                          
                 img_src = chapter_content.find( 
                 'img' 
                 )[ 
                 'src' 
                 ] 
                

然后我們終於獲取到了圖片的src，但是還有個問題，他是分頁的，所以。。
<ignore_js_op>
仔細鑽研后發現如果當前頁不存在時會顯示這個圖片，那我們就一直循環，直到獲取的到的圖片是這個時，結束循環，也就是這個樣子↓

[Python] 純文本查看復制代碼

 
                 while 
                 True 
                 : 
                
 
                              
                 li_a_href_replace  
                 = 
                 li_a_href 
                
 
                              
                 if 
                 i ! 
                 = 
                 0 
                 : 
                 #不加-i就是第一頁 
                
 
                                  
                 li_a_href_replace  
                 = 
                 li_a_href.replace( 
                 '.' 
                 , ( 
                 '-' 
                 + 
                 str 
                 (i)  
                 + 
                 '.' 
                 )) 
                 #https://m.gufengmh8.com/manhua/wanmeishijiePERFECTWORLD/549627.html把"."換成"-1."https://m.gufengmh8.com/manhua/wanmeishijiePERFECTWORLD/549627-1.html就是第二頁了 
                
 
                              
                 print 
                 (li_a_href_replace) 
                
 
                              
                 chapter_html  
                 = 
                 BeautifulSoup(get_document( 
                 'https://m.gufengmh8.com' 
                 + 
                 li_a_href_replace)) 
                
 
                              
                 chapter_content  
                 = 
                 chapter_html.find( 
                 'div' 
                 , attrs 
                 = 
                 { 
                 'class' 
                 :  
                 'chapter-content' 
                 }) 
                
 
                              
                 img_src  
                 = 
                 chapter_content.find( 
                 'img' 
                 )[ 
                 'src' 
                 ] 
                
 
                              
                 if 
                 img_src.__eq__( 
                 'https://res.xiaoqinre.com/images/default/cover.png' 
                 ): 
                
 
                                  
                 break 
                

5.然后我們獲取到了所有的漫畫圖片src，現在就只需要把他下載下來了,先創建目錄

[Python] 純文本查看復制代碼

 
                 path  
                 = 
                 "d:/SanMu/" 
                 + 
                 book_name 
                 + 
                 '/' 
                 + 
                 li.text.replace( 
                 '\n' 
                 , '') 
                

                    
                
 
                 if 
                 not 
                 os.path.exists(path): 
                
 
                              
                 os.makedirs(path) 
                

然后下載，很簡單吧

[Python] 純文本查看復制代碼

1	`open` `(path` `+` `'/'` `+` `str` `(i)` `+` `'.jpg'` `,` `'wb'` `).write(get_document(img_src))` `#保存到d:/SanMu/書名/章節名/0.jpg`

最后放出綜合代碼

[Python] 純文本查看復制代碼

 
                 import 
                 math 
                
 
                 import 
                 threading 
                
 
                 import 
                 time 
                
 
                 import 
                 os 
                
 
                 import 
                 requests 
                
 
                 from 
                 bs4  
                 import 
                 BeautifulSoup 
                
 
                 from 
                 urllib3.connectionpool  
                 import 
                 xrange 
                

                    
                

                    
                
 
                 def 
                 split_list(ls, each): 
                
 
                      
                 list 
                 = 
                 [] 
                
 
                      
                 eachExact  
                 = 
                 float 
                 (each) 
                
 
                      
                 groupCount  
                 = 
                 int 
                 ( 
                 len 
                 (ls)  
                 / 
                 / 
                 each) 
                
 
                      
                 groupCountExact  
                 = 
                 math.ceil( 
                 len 
                 (ls)  
                 / 
                 eachExact) 
                
 
                      
                 start  
                 = 
                 0 
                
 
                      
                 for 
                 i  
                 in 
                 xrange 
                 (each): 
                
 
                          
                 if 
                 i  
                 = 
                 = 
                 each  
                 - 
                 1 
                 & groupCount < groupCountExact:   
                 # 假如有余數，將剩余的所有元素加入到最后一個分組 
                
 
                              
                 list 
                 .append(ls[start: 
                 len 
                 (ls)]) 
                
 
                          
                 else 
                 : 
                
 
                              
                 list 
                 .append(ls[start:start  
                 + 
                 groupCount]) 
                
 
                          
                 start  
                 = 
                 start  
                 + 
                 groupCount 
                

                    
                
 
                      
                 return 
                 list 
                

                    
                

                    
                
 
                 def 
                 get_document(url): 
                
 
                      
                 # print(url) 
                
 
                      
                 try 
                 : 
                
 
                          
                 get  
                 = 
                 requests.get(url) 
                
 
                          
                 data  
                 = 
                 get.content 
                
 
                          
                 get.close() 
                
 
                      
                 except 
                 : 
                
 
                          
                 time.sleep( 
                 3 
                 ) 
                
 
                          
                 try 
                 : 
                
 
                              
                 get  
                 = 
                 requests.get(url) 
                
 
                              
                 data  
                 = 
                 get.content 
                
 
                              
                 get.close() 
                
 
                          
                 except 
                 : 
                
 
                              
                 time.sleep( 
                 3 
                 ) 
                
 
                              
                 get  
                 = 
                 requests.get(url) 
                
 
                              
                 data  
                 = 
                 get.content 
                
 
                              
                 get.close() 
                
 
                      
                 return 
                 data 
                

                    
                

                    
                
 
                 def 
                 download_img(html): 
                
 
                      
                 chapter_url_list 
                 = 
                 [] 
                
 
                      
                 soup  
                 = 
                 BeautifulSoup(html) 
                 #BeautifulSoup和request搭配使用更佳呦 
                
 
                      
                 itemBox  
                 = 
                 soup.find_all( 
                 'div' 
                 , attrs 
                 = 
                 { 
                 'class' 
                 :  
                 'itemBox' 
                 }) 
                 #find_all返回的是一個list 
                
 
                      
                 for 
                 index, item  
                 in 
                 enumerate 
                 (itemBox): 
                 #遍歷itemBox，index是當前項list的下標，item是內容 
                
 
                          
                 itemTxt  
                 = 
                 item.find( 
                 'div' 
                 , attrs 
                 = 
                 { 
                 'class' 
                 :  
                 'itemTxt' 
                 }) 
                 #因為只有一個，所以itemBox中只有一個itemTxt所以這次我們用find 
                
 
                          
                 a  
                 = 
                 itemTxt.find( 
                 'a' 
                 , attrs 
                 = 
                 { 
                 'class' 
                 :  
                 'title' 
                 }) 
                
 
                          
                 chapter_url  
                 = 
                 a[ 
                 'href' 
                 ] 
                
 
                          
                 chapter_url_list.append(chapter_url) 
                 #把所有書的url存起來 
                
 
                          
                 print 
                 ( 
                 str 
                 (index 
                 + 
                 1 
                 ) 
                 + 
                 '.' 
                 + 
                 a.text) 
                
 
                      
                 number  
                 = 
                 int 
                 ( 
                 input 
                 ( 
                 '請輸入漫畫序號' 
                 )) 
                
 
                      
                 chapter_html_list  
                 = 
                 BeautifulSoup(get_document(chapter_url_list[number 
                 - 
                 1 
                 ])) 
                 #因為打印的序號和list的索引是相差1的,所以輸入的序號減一獲取對應書的url，再根據url獲取到目錄頁面 
                
 
                      
                 ul  
                 = 
                 chapter_html_list.find( 
                 'ul' 
                 , attrs 
                 = 
                 { 
                 'id' 
                 :  
                 'chapter-list-1' 
                 }) 
                 #獲取到ul 
                
 
                      
                 book_name  
                 = 
                 chapter_html_list.find( 
                 'h1' 
                 , attrs 
                 = 
                 { 
                 'class' 
                 :  
                 'title' 
                 }).text 
                 #獲取到ul 
                
 
                      
                 li_list  
                 = 
                 ul.find_all( 
                 'li' 
                 ) 
                 #獲取其中所有li 
                
 
                      
                 for 
                 li  
                 in 
                 li_list: 
                 #遍歷 
                
 
                          
                 li_a_href  
                 = 
                 li.find( 
                 'a' 
                 )[ 
                 'href' 
                 ] 
                 #注意這里獲取到的url是不完整的/manhua/buhuochongwuniangdezhengquefangfa/1000845.html 
                
 
                          
                 i  
                 = 
                 0 
                
 
                          
                 path  
                 = 
                 "d:/SanMu/" 
                 + 
                 book_name 
                 + 
                 '/' 
                 + 
                 li.text.replace( 
                 '\n' 
                 , '') 
                
 
                          
                 if 
                 not 
                 os.path.exists(path): 
                
 
                              
                 os.makedirs(path) 
                
 
                          
                 while 
                 True 
                 : 
                
 
                              
                 li_a_href_replace  
                 = 
                 li_a_href 
                
 
                              
                 if 
                 i ! 
                 = 
                 0 
                 : 
                
 
                                  
                 li_a_href_replace  
                 = 
                 li_a_href.replace( 
                 '.' 
                 , ( 
                 '-' 
                 + 
                 str 
                 (i)  
                 + 
                 '.' 
                 )) 
                
 
                              
                 print 
                 (li_a_href_replace) 
                
 
                              
                 chapter_html  
                 = 
                 BeautifulSoup(get_document( 
                 'https://m.gufengmh8.com' 
                 + 
                 li_a_href_replace)) 
                
 
                              
                 chapter_content  
                 = 
                 chapter_html.find( 
                 'div' 
                 , attrs 
                 = 
                 { 
                 'class' 
                 :  
                 'chapter-content' 
                 }) 
                
 
                              
                 img_src  
                 = 
                 chapter_content.find( 
                 'img' 
                 )[ 
                 'src' 
                 ] 
                
 
                              
                 if 
                 img_src.__eq__( 
                 'https://res.xiaoqinre.com/images/default/cover.png' 
                 ): 
                
 
                                  
                 break 
                
 
                              
                 chapter_content  
                 = 
                 chapter_html.find( 
                 'div' 
                 , attrs 
                 = 
                 { 
                 'class' 
                 :  
                 'chapter-content' 
                 }) 
                
 
                              
                 img_src  
                 = 
                 chapter_content.find( 
                 'img' 
                 )[ 
                 'src' 
                 ] 
                
 
                              
                 open 
                 (path 
                 + 
                 '/' 
                 + 
                 str 
                 (i) 
                 + 
                 '.jpg' 
                 ,  
                 'wb' 
                 ).write(get_document(img_src)) 
                 #保存到d:/SanMu/書名/章節名/0.jpg 
                
 
                              
                 i  
                 + 
                 = 
                 1 
                

                    
                

                    
                
 
                 download_img(get_document( 
                 "https://m.gufengmh8.com/search/?keywords=" 
                 + 
                 str 
                 ( 
                 input 
                 ( 
                 "搜索漫畫:" 
                 )))) 
                

到這就結束了，不曉得有沒有人會看我的文章呢，有沒有呢，沒有呢，有呢，呢~~~（明確暗示）

<ignore_js_op>

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python入門教程！手把手教會你爬取網頁數據如何用 Python 爬取微博評論，通過王某宏事件來手把手教學手把手教Mint-ui 手把手教系列之快速傅立葉算法手把手教Electron+vue的使用 A：手把手教Wordpress仿站（基礎）手把手教做小偷采集手把手教大家安裝 Intellij IDEA 手把手教Electron+vue的使用手把手教小白安裝Erlang