聲控黨的福利!Python爬取【貓耳FM】音頻數據,用多線程對比通用爬蟲的速度


前言

今天的“受害者”為【貓耳FM】,一個音頻網站

地址:https://www.missevan.com/sound/m/110

  • requests
  • time
  • re
  • concurrent.futures

開發環境:

  • 版 本:anaconda5.2.0(python3.6.5)
  • 編輯器:pycharm

導入模塊

import time
import requests
import concurrent.futures
import re

 

通過函數式編程,實現各個功能模塊

發送請求

def get_html(url):
    response = requests.get(url)
    return response

 

第一次解析

def parse(response):
    mp3_ids = re.findall('<a target="_player" href="/sound/(.*?)" title=".*?">', response.text)
    return mp3_ids

 

第二次解析

def parse_2(response):
    json_data = response.json()
    title = json_data['info']['sound']['soundstr']
    soundurl = json_data['info']['sound']['soundurl']
    return title, soundurl

 

保存數據

def save(title, mp3_data):
    with open('mp3\\' + title + '.mp3', mode='wb') as f:
        f.write(mp3_data)
        print(title, '下載完成!!!')

 

修改標題

def change_title(title):
    new_title = re.sub(r'[\//|:?<>"*]', '_', title)
    return new_title

 

主函數,調用里面包含的整體連貫

# 1. 發送請求
response = get_html(url)
# 2. 解析數據 soundid
mp3_ids = parse(response)
for mp3_id in mp3_ids:
    # 3. 請求另外詳情頁 地址拼接 https://www.missevan.com/sound/getsound?soundid=3922170
    mp3_url = 'https://www.missevan.com/sound/getsound?soundid=' + mp3_id
    resp_2 = get_html(mp3_url)
    # 4. 解析音頻url地址 音頻標題
    title, soundurl = parse_2(resp_2)
    # 修改標題
    title = change_title(title)
    # 5. 請求音頻url地址 音頻 二進制數據 content
    mp3_data = get_html(soundurl).content
    # 6. 下載保存 到本地
    save(title, mp3_data)

 

翻頁

start_time = time.time()
for page in range(1, 5):
    print(f'----------正在爬取第{page}頁-------------')
    run(f'https://www.missevan.com/sound/m?id=110&p={page}')
print('一共花費了:', time.time()-start_time)

 

多線程

if __name__ == '__main__':
    start_time = time.time()
    with concurrent.futures.ThreadPoolExecutor(max_workers=1000) as executor:
        for page in range(1, 5):
            url = f'https://www.missevan.com/sound/m?id=110&p={page}'
            executor.submit(run, url)
    print('一共花費了:', time.time()-start_time)

 

速度提升了一分鍾左右

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM