手把手教python爬取漫画(每一步都有注释)

本文转载自查看原文 2020-09-03 08:52 1374

本人也刚学，本帖水平含量不高，有什么问题请指教
想要编写一个爬虫，不管用什么语言最重要的都是先获取所需要的内容在网页中的位置，
就是说我们要获取到他的唯一标识，就比如根据标签的id或class，id和class获取的区别在于，id是唯一的，所以只会获取到一条数据，而class则不一样，一个页面可能会有多条class，
所以如果要根据class获取数据，你需要找到你所需要的数据在第几个class，当然除了根据id我们也可以根据标签名来获取，这个就更加宽泛了，接下来我们以爬取漫画为例，手把手写一个爬虫，手把手奥（明确暗示）

1.首先我们找到要爬取的漫画网站我这里以https://m.gufengmh8.com/为例，截图为搜索页面，可以看到网址为https://m.gufengmh8.com/search/?keywords=完美世界
keywords后面跟的就是要搜索的内容，然后我们获取url的方式就可以是这样

[Python] 纯文本查看复制代码

1	`"https://m.gufengmh8.com/manhua/search/?keywords="` `+` `str` `(` `input` `(` `"搜索漫画:"` `))` `#input让用户输入，获取输入内容`

<ignore_js_op>
2.然后我们开始对这个页面进行剖析，我们要获取的内容有哪些呢，在这里就不写太复杂，只爬取漫画名供用户选择就行，毕竟同名的漫画也不多嘛（其实就是太懒）
浏览器按f12进入代码调试，单击下图位置，然后可以看到class为itemBox,所以我们只需要获取到这个页面所有的class为itemBox的div，就可以获取每本漫画的所有信息，
在这里只取漫画名，再用小箭头点击漫画名，可以看到a标签下的就是要获取的漫画名，所以逻辑就清晰了，先获取class，然后遍历class获取到每个class中的itemTxt，然后再获取到itemTxt的第一个节点
<ignore_js_op> <ignore_js_op> <ignore_js_op>
然后现在我们的代码就变成这样

[Python] 纯文本查看复制代码

 
         import 
         math 
        
         import 
         threading 
        
         import 
         time 
        
         import 
         os 
        
         import 
         requests 
        
         from 
         bs4  
         import 
         BeautifulSoup 
        
         from 
         urllib3.connectionpool  
         import 
         xrange 
        
         #根据url获取对应页面的所有内容，然后返回 
        
         def 
         get_document(url): 
        
         # print(url) 
        
         try 
         : 
        
         get  
         = 
         requests.get(url) 
         #打开连接 
        
         data  
         = 
         get.content 
         #获取内容 
        
         get.close() 
         #关闭连接 
        
         except 
         : 
         #抛异常就重试 
        
         time.sleep( 
         3 
         ) 
         #睡眠3秒，给网页反应时间 
        
         try 
         :再次获取 
        
         get  
         = 
         requests.get(url) 
        
         data  
         = 
         get.content 
        
         get.close() 
        
         except 
         : 
        
         time.sleep( 
         3 
         ) 
        
         get  
         = 
         requests.get(url) 
        
         data  
         = 
         get.content 
        
         get.close() 
        
         return 
         data 
        
         #下载漫画 
        
         def 
         download_img(html): 
        
         soup  
         = 
         BeautifulSoup(html) 
         #BeautifulSoup和request搭配使用更佳呦 
        
         itemBox  
         = 
         soup.find_all( 
         'div' 
         , attrs 
         = 
         { 
         'class' 
         :  
         'itemBox' 
         }) 
         #find_all返回的是一个list 
        
         for 
         index, item  
         in 
         enumerate 
         (itemBox): 
         #遍历itemBox，index是当前项list的下标，item是内容 
        
         itemTxt  
         = 
         item.find( 
         'div' 
         , attrs 
         = 
         { 
         'class' 
         :  
         'itemTxt' 
         }) 
         #因为只有一个，所以itemBox中只有一个itemTxt所以这次我们用find 
        
         a  
         = 
         itemTxt.find( 
         'a' 
         , attrs 
         = 
         { 
         'class' 
         :  
         'title' 
         }).text[] 
        
         print 
         ( 
         str 
         (index 
         + 
         1 
         ) 
         + 
         '.' 
         + 
         a) 
        
         # download_img(get_document("https://m.gufengmh8.com/search/?keywords="+str(input("搜索漫画:")))) 
        
         download_img(get_document( 
         "https://m.gufengmh8.com/search/?keywords=完美世界" 
         )) 
         #这个就不解释了吧

执行后打印这样

[Python] 纯文本查看复制代码

 
         1. 
         捕获宠物娘的正确方法 
        
         2. 
         吾猫当仙 
        
         3. 
         百诡谈 
        
         4. 
         完美世界PERFECTWORLD 
        
         5. 
         诛仙·御剑行 
        
         6. 
         洞仙歌 
        
         7. 
         猫仙生 
        
         8. 
         完美世界

3.现在我们基本实现了搜索功能，这已经算是个简单爬虫了，之后让用户输入书籍序号，然后下载
我们随便点进去一本漫画，用之前的方式获取到id为chapter-list-1的ul包含了所有的章节，ul中的每一个li又包含一个a标签和span标签，分别是url和章节名,之后就可以继续写了
<ignore_js_op> <ignore_js_op>

[Asm] 纯文本查看复制代码

 
         def download_img(html): 
        
         chapter_url_list=[] 
        
         soup = BeautifulSoup(html)#BeautifulSoup和request搭配使用更佳呦 
        
         itemBox = soup.find_all( 
         'div' 
         , attrs={ 
         'class' 
         :  
         'itemBox' 
         })#find_all返回的是一个list 
        
         for 
         index, item  
         in 
         enumerate(itemBox):#遍历itemBox，index是当前项list的下标，item是内容 
        
         itemTxt = item.find( 
         'div' 
         , attrs={ 
         'class' 
         :  
         'itemTxt' 
         })#因为只有一个，所以itemBox中只有一个itemTxt所以这次我们用find 
        
         a = itemTxt.find( 
         'a' 
         , attrs={ 
         'class' 
         :  
         'title' 
         }) 
        
         chapter_url = a[ 
         'href' 
         ] 
        
         chapter_url_list.append(chapter_url)#把所有书的url存起来 
        
         print( 
         str 
         (index+1)+ 
         '.' 
         +a 
         .text 
         ) 
        
         number =  
         int 
         (input( 
         '请输入漫画序号' 
         )) 
        
         chapter_html = BeautifulSoup(get_document(chapter_url_list[number-1]))#因为打印的序号和list的索引是相差1的,所以输入的序号减一获取对应书的url，再根据url获取到目录页面 
        
         ul = chapter_html.find( 
         'ul' 
         , attrs={ 
         'id' 
         :  
         'chapter-list-1' 
         })#获取到ul 
        
         li_list = ul.find_all( 
         'li' 
         )#获取其中所有li 
        
         for 
         li  
         in 
         li_list:#遍历 
        
         li_a_href = li.find( 
         'a' 
         )[ 
         'href' 
         ]#注意这里获取到的url是不完整的/manhua/buhuochongwuniangdezhengquefangfa/1000845.html

4.现在我们随便点入一个章节获取到漫画图片的位置
<ignore_js_op>

[Asm] 纯文本查看复制代码

 
         chapter_html = BeautifulSoup(get_document( 
         'https://m.gufengmh8.com' 
         + li_a_href)) 
        
 
                  
         chapter_content = chapter_html.find( 
         'div' 
         , attrs={ 
         'class' 
         :  
         'chapter-content' 
         }) 
        
 
                  
         img_src = chapter_content.find( 
         'img' 
         )[ 
         'src' 
         ] 
        

然后我们终于获取到了图片的src，但是还有个问题，他是分页的，所以。。
<ignore_js_op>
仔细钻研后发现如果当前页不存在时会显示这个图片，那我们就一直循环，直到获取的到的图片是这个时，结束循环，也就是这个样子↓

[Python] 纯文本查看复制代码

 
         while 
         True 
         : 
        
 
                      
         li_a_href_replace  
         = 
         li_a_href 
        
 
                      
         if 
         i ! 
         = 
         0 
         : 
         #不加-i就是第一页 
        
 
                          
         li_a_href_replace  
         = 
         li_a_href.replace( 
         '.' 
         , ( 
         '-' 
         + 
         str 
         (i)  
         + 
         '.' 
         )) 
         #https://m.gufengmh8.com/manhua/wanmeishijiePERFECTWORLD/549627.html把"."换成"-1."https://m.gufengmh8.com/manhua/wanmeishijiePERFECTWORLD/549627-1.html就是第二页了 
        
 
                      
         print 
         (li_a_href_replace) 
        
 
                      
         chapter_html  
         = 
         BeautifulSoup(get_document( 
         'https://m.gufengmh8.com' 
         + 
         li_a_href_replace)) 
        
 
                      
         chapter_content  
         = 
         chapter_html.find( 
         'div' 
         , attrs 
         = 
         { 
         'class' 
         :  
         'chapter-content' 
         }) 
        
 
                      
         img_src  
         = 
         chapter_content.find( 
         'img' 
         )[ 
         'src' 
         ] 
        
 
                      
         if 
         img_src.__eq__( 
         'https://res.xiaoqinre.com/images/default/cover.png' 
         ): 
        
 
                          
         break 
        

5.然后我们获取到了所有的漫画图片src，现在就只需要把他下载下来了,先创建目录

[Python] 纯文本查看复制代码

 
         path  
         = 
         "d:/SanMu/" 
         + 
         book_name 
         + 
         '/' 
         + 
         li.text.replace( 
         '\n' 
         , '') 
        

            
        
 
         if 
         not 
         os.path.exists(path): 
        
 
                      
         os.makedirs(path) 
        

然后下载，很简单吧

[Python] 纯文本查看复制代码

1	`open` `(path` `+` `'/'` `+` `str` `(i)` `+` `'.jpg'` `,` `'wb'` `).write(get_document(img_src))` `#保存到d:/SanMu/书名/章节名/0.jpg`

最后放出综合代码

[Python] 纯文本查看复制代码

 
         import 
         math 
        
 
         import 
         threading 
        
 
         import 
         time 
        
 
         import 
         os 
        
 
         import 
         requests 
        
 
         from 
         bs4  
         import 
         BeautifulSoup 
        
 
         from 
         urllib3.connectionpool  
         import 
         xrange 
        

            
        

            
        
 
         def 
         split_list(ls, each): 
        
 
              
         list 
         = 
         [] 
        
 
              
         eachExact  
         = 
         float 
         (each) 
        
 
              
         groupCount  
         = 
         int 
         ( 
         len 
         (ls)  
         / 
         / 
         each) 
        
 
              
         groupCountExact  
         = 
         math.ceil( 
         len 
         (ls)  
         / 
         eachExact) 
        
 
              
         start  
         = 
         0 
        
 
              
         for 
         i  
         in 
         xrange 
         (each): 
        
 
                  
         if 
         i  
         = 
         = 
         each  
         - 
         1 
         & groupCount < groupCountExact:   
         # 假如有余数，将剩余的所有元素加入到最后一个分组 
        
 
                      
         list 
         .append(ls[start: 
         len 
         (ls)]) 
        
 
                  
         else 
         : 
        
 
                      
         list 
         .append(ls[start:start  
         + 
         groupCount]) 
        
 
                  
         start  
         = 
         start  
         + 
         groupCount 
        

            
        
 
              
         return 
         list 
        

            
        

            
        
 
         def 
         get_document(url): 
        
 
              
         # print(url) 
        
 
              
         try 
         : 
        
 
                  
         get  
         = 
         requests.get(url) 
        
 
                  
         data  
         = 
         get.content 
        
 
                  
         get.close() 
        
 
              
         except 
         : 
        
 
                  
         time.sleep( 
         3 
         ) 
        
 
                  
         try 
         : 
        
 
                      
         get  
         = 
         requests.get(url) 
        
 
                      
         data  
         = 
         get.content 
        
 
                      
         get.close() 
        
 
                  
         except 
         : 
        
 
                      
         time.sleep( 
         3 
         ) 
        
 
                      
         get  
         = 
         requests.get(url) 
        
 
                      
         data  
         = 
         get.content 
        
 
                      
         get.close() 
        
 
              
         return 
         data 
        

            
        

            
        
 
         def 
         download_img(html): 
        
 
              
         chapter_url_list 
         = 
         [] 
        
 
              
         soup  
         = 
         BeautifulSoup(html) 
         #BeautifulSoup和request搭配使用更佳呦 
        
 
              
         itemBox  
         = 
         soup.find_all( 
         'div' 
         , attrs 
         = 
         { 
         'class' 
         :  
         'itemBox' 
         }) 
         #find_all返回的是一个list 
        
 
              
         for 
         index, item  
         in 
         enumerate 
         (itemBox): 
         #遍历itemBox，index是当前项list的下标，item是内容 
        
 
                  
         itemTxt  
         = 
         item.find( 
         'div' 
         , attrs 
         = 
         { 
         'class' 
         :  
         'itemTxt' 
         }) 
         #因为只有一个，所以itemBox中只有一个itemTxt所以这次我们用find 
        
 
                  
         a  
         = 
         itemTxt.find( 
         'a' 
         , attrs 
         = 
         { 
         'class' 
         :  
         'title' 
         }) 
        
 
                  
         chapter_url  
         = 
         a[ 
         'href' 
         ] 
        
 
                  
         chapter_url_list.append(chapter_url) 
         #把所有书的url存起来 
        
 
                  
         print 
         ( 
         str 
         (index 
         + 
         1 
         ) 
         + 
         '.' 
         + 
         a.text) 
        
 
              
         number  
         = 
         int 
         ( 
         input 
         ( 
         '请输入漫画序号' 
         )) 
        
 
              
         chapter_html_list  
         = 
         BeautifulSoup(get_document(chapter_url_list[number 
         - 
         1 
         ])) 
         #因为打印的序号和list的索引是相差1的,所以输入的序号减一获取对应书的url，再根据url获取到目录页面 
        
 
              
         ul  
         = 
         chapter_html_list.find( 
         'ul' 
         , attrs 
         = 
         { 
         'id' 
         :  
         'chapter-list-1' 
         }) 
         #获取到ul 
        
 
              
         book_name  
         = 
         chapter_html_list.find( 
         'h1' 
         , attrs 
         = 
         { 
         'class' 
         :  
         'title' 
         }).text 
         #获取到ul 
        
 
              
         li_list  
         = 
         ul.find_all( 
         'li' 
         ) 
         #获取其中所有li 
        
 
              
         for 
         li  
         in 
         li_list: 
         #遍历 
        
 
                  
         li_a_href  
         = 
         li.find( 
         'a' 
         )[ 
         'href' 
         ] 
         #注意这里获取到的url是不完整的/manhua/buhuochongwuniangdezhengquefangfa/1000845.html 
        
 
                  
         i  
         = 
         0 
        
 
                  
         path  
         = 
         "d:/SanMu/" 
         + 
         book_name 
         + 
         '/' 
         + 
         li.text.replace( 
         '\n' 
         , '') 
        
 
                  
         if 
         not 
         os.path.exists(path): 
        
 
                      
         os.makedirs(path) 
        
 
                  
         while 
         True 
         : 
        
 
                      
         li_a_href_replace  
         = 
         li_a_href 
        
 
                      
         if 
         i ! 
         = 
         0 
         : 
        
 
                          
         li_a_href_replace  
         = 
         li_a_href.replace( 
         '.' 
         , ( 
         '-' 
         + 
         str 
         (i)  
         + 
         '.' 
         )) 
        
 
                      
         print 
         (li_a_href_replace) 
        
 
                      
         chapter_html  
         = 
         BeautifulSoup(get_document( 
         'https://m.gufengmh8.com' 
         + 
         li_a_href_replace)) 
        
 
                      
         chapter_content  
         = 
         chapter_html.find( 
         'div' 
         , attrs 
         = 
         { 
         'class' 
         :  
         'chapter-content' 
         }) 
        
 
                      
         img_src  
         = 
         chapter_content.find( 
         'img' 
         )[ 
         'src' 
         ] 
        
 
                      
         if 
         img_src.__eq__( 
         'https://res.xiaoqinre.com/images/default/cover.png' 
         ): 
        
 
                          
         break 
        
 
                      
         chapter_content  
         = 
         chapter_html.find( 
         'div' 
         , attrs 
         = 
         { 
         'class' 
         :  
         'chapter-content' 
         }) 
        
 
                      
         img_src  
         = 
         chapter_content.find( 
         'img' 
         )[ 
         'src' 
         ] 
        
 
                      
         open 
         (path 
         + 
         '/' 
         + 
         str 
         (i) 
         + 
         '.jpg' 
         ,  
         'wb' 
         ).write(get_document(img_src)) 
         #保存到d:/SanMu/书名/章节名/0.jpg 
        
 
                      
         i  
         + 
         = 
         1 
        

            
        

            
        
 
         download_img(get_document( 
         "https://m.gufengmh8.com/search/?keywords=" 
         + 
         str 
         ( 
         input 
         ( 
         "搜索漫画:" 
         )))) 
        

到这就结束了，不晓得有没有人会看我的文章呢，有没有呢，没有呢，有呢，呢~~~（明确暗示）

<ignore_js_op>

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 如何用 Python 爬取微博评论，通过王某宏事件来手把手教学 Java学习第一步：JDK环境搭建（手把手教你）手把手教做小偷采集手把手教大家安装 Intellij IDEA 手把手教Electron+vue的使用手把手教小白安装Erlang 手把手教你爬取优酷电影信息 -1 手把手教你爬取优酷电影信息-2 手把手教大家在mac上用VMWare虚拟机装Ubuntu 五分钟搞定 HTTPS 配置，二哥手把手教