X音采集

開源倉庫

https://gitee.com/erma0/douyin

介紹

Python取數據 + Vue寫界面 + Aria2下載

根據X音各種鏈接或各種id，通過網頁接口采集視頻作品，並下載作品到本地。

支持用戶主頁鏈接或sec_uid/話題挑戰和音樂原聲鏈接或ID。

支持下載喜歡列表（需喜歡列表可見）。

①　2000多本Python電子書（主流和經典的書籍應該都有了）

②　Python標准庫資料（最全中文版）

③　項目源碼（四五十個有趣且經典的練手項目及源碼）

④　Python基礎入門、爬蟲、web開發、大數據分析方面的視頻（適合小白學習）

⑤ Python學習路線圖（告別不入流的學習）

當然在學習Python的道路上肯定會困難，沒有好的學習資料，怎么去學習呢？ 
學習Python中有不明白推薦加入交流Q群號：928946953 
群里有志同道合的小伙伴，互幫互助， 群里有不錯的視頻學習教程和PDF！
還有大牛解答！

使用

0x00 安裝依賴

在程序目錄打開命令行，輸入

 復制代碼 隱藏代碼
pip install -r requirements.txt

0x01 使用UI界面

雙擊打開啟動.bat，或者在程序目錄打開命令行，輸入

 復制代碼 隱藏代碼
python ui.py

0x02 直接修改douyin.py中相關參數使用

完全不懂Python的朋友用命令行或操作界面。

0x03 從命令行使用exec.py

直接運行可查看命令列表，或使用-h參數查看幫助復制代碼隱藏代碼
python exec.py python exec.py -h python exec.py download -h python exec.py download_batch -h
使用函數名調用程序復制代碼隱藏代碼
--type 指定下載類型，默認值：--type=user --limit 指定采集數量，默認值：--limit=0（不限制）例如采集某用戶全部作品：復制代碼隱藏代碼
python exec.py download https://v.douyin.com/xxxx/ python exec.py download 用戶的secuid例如采集某用戶喜歡的前10個作品：復制代碼隱藏代碼
python exec.py download MS4wLjABAAAAl7TJWjJJrnu11IlllB6Mi5V9VbAsQo1N987guPjctc8 --type=like --limit=10 python exec.py download 用戶的secuid例如采集某音樂原聲前10個作品：復制代碼隱藏代碼
python exec.py download https://v.douyin.com/xxxx/ --type=music --limit=10 python exec.py download 音樂ID --type=music --limit=10

TODO

[x] 采集用戶作品
[x] 調用Aria2下載
[x] 話題/原聲作品采集
[x] 喜歡作品采集
[x] 導入文件批量采集
[x] 命令行調用
[x] 用webview寫界面
[x] ~~打包exe~~ 不打包了，直接裝個Python環境更簡單

知識點

X音相關

網頁接口恢復了，一次請求即可取回數據
UID幾乎沒用了，拼不成主頁鏈接了，所有接口都是sec_uid
signature可固定了，不用再扣JS了
作品中直接包含無水印視頻地址了，不需要移動端UA也可跳轉
話題/音樂作品數目
2021.04.02 喜歡列表也有數據了

Aria2相關

aria2p庫使用體驗還不錯
大部分Aria2下載都是通過rpc接口實現的，這個也一樣
需要自己下載Aria2c.exe來開啟服務，所以要用代碼實現自動啟動服務
若文件已存在則跳過下載的方法：
--auto-file-renaming=false 可行，但控制台使用會報錯，雖然報錯不影響
-c 可行，且控制台不報錯
添加下載任務時通過指定options = {'out': filename}指定文件名，即-d
Aria2會根據指定路徑及指定文件名自動創建下載目錄
Aria2指定路徑及文件名中不能傳入非法字符串（*|等），所以寫了Download.title2path靜態方法
監聽事件要手動停止，不停止會阻塞進程，導致程序無法關閉
未發現實時獲取任務進度及下載速度的函數，自己寫了循環監聽回調方法

Python相關

通過os.popen或subprocess.Popen實現子進程打開程序，無界面，不阻塞
繼承父類后重寫init時，通過super().init()調用父類構造方法
繼承父類后重寫方法時，不能重寫私有方法，不能讀取私有成員
參數指定類型提示挺好用，方便調用參數的函數時自動補全
可通過if 'PROGRAMFILES(X86)' in os.environ簡單判斷系統是否為64位
Pylance的自動導入依賴功能很好用，就是感覺時靈時不靈，重新開關后又可以用
vscode默認啟動路徑是當前項目路徑，在launch.json中加一句"cwd": "${fileDirname}",即可，不過自動補全pylance就無法識別相對目錄了
用pipreqs一鍵生成當前項目依賴：cmd切換到項目路徑，pipreqs ./ --encoding=utf-8 --force

命令行模塊fire相關

最簡單的方法就是直接一個fire.Fire()，暴露全部函數
如果用類或對象暴露，類參數需要單獨指定
組合命令需要用不同的類，暴露的類中引入需用組合命令的類，但是在這個批量下載的場景下感覺比較繁瑣，所以直接加了個參數，分兩個函數來調用

UI模塊pywebview相關

可以把一個類的實例暴露給頁面js_api，通過pywebview.api.func().then(() => {})調用Python函數
也可以把Flask等服務實例暴露給頁面js_api（無需url參數），在內部實現index.html
Python中通過window.evaluate_js('JS代碼')調用JS方法
在UI中，類的初始化無法傳參，所以需要重新定義init
在UI中，需要公開的類實例方法不能以下划線_開頭
創建UI時設置的窗口寬高，好像和網頁中大小不一樣，值需要比網頁中大一些

X音采集部分源碼

 復制代碼 隱藏代碼
# -*- encoding: utf-8 -*-
'''
@File    :   douyin.py
@Time    :   2021年03月12日 18:16:57 星期五
@Author  :   erma0
@Version :   1.0
@Link    :   https://erma0.cn
@Desc    :   X音用戶作品采集
'''
import json
import os
import time
from urllib.parse import parse_qs, urlparse

import requests

from download import Download

class Douyin(object):
    """
    X音用戶類
    采集作品列表
    """
    def __init__(self, param: str, limit: int = 0):
        """
        初始化用戶信息
        參數自動判斷：ID/URL
        """
        self.limit = limit
        self.http = requests.Session()
        self.url = ''
        self.type = 'unknow'
        self.download_path = '暫未定義目錄'
        # ↑ 預定義屬性，避免調用時未定義 ↑
        self.param = param.strip()
        self.sign = 'TG2uvBAbGAHzG19a.rniF0xtrq'  # sign可以固定
        self.__get_type()  # 判斷當前任務類型：鏈接/ID
        self.aria2 = Download()  # 初始化Aria2下載服務，先不指定目錄了，在設置文件名的時候再加入目錄
        self.has_more = True
        self.finish = False
        # 字典格式方便入庫用id做key/取值/修改對應數據，但是表格都接收數組
        self.videosL = []  #列表格式
        # self.videos = {}  #字典格式
        self.gids = {}  # gid和作品序號映射

    def __get_type(self):
        """
        判斷當前任務類型
        鏈接/ID
        """
        if '://' in self.param:  # 鏈接
            self.__url2redirect()
        else:  # ID
            self.id = self.param

    def __url2redirect(self):
        """
        取302跳轉地址
        短連接轉長鏈接
        """
        headers = {  # 以前作品需要解析去水印，要用到移動端UA，現在不用了
            'User-Agent':
            'Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1 Edg/89.0.4389.82'
        }
        try:
            r = self.http.head(self.param, headers=headers, allow_redirects=False)
            self.url = r.headers['Location']
        except:
            self.url = self.param

    def __url2id(self):
        try:
            self.id = urlparse(self.url).path.split('/')[3]
        except:
            self.id = ''

    def __url2uid(self):
        try:
            query = urlparse(self.url).query
            self.id = parse_qs(query)['sec_uid'][0]
        except:
            self.id = ''

    def get_sign(self):
        """
        網頁sign算法，現在不需要了，直接固定
        """
        self.sign = 'TG2uvBAbGAHzG19a.rniF0xtrq'
        return self.sign

    def get_user_info(self):
        """
        取用戶信息
        查詢結果在 self.user_info
        """
        if self.url:
            self.__url2uid()
        url = 'https://www.iesdouyin.com/web/api/v2/user/info/?sec_uid=' + self.id
        try:
            res = self.http.get(url).json()
            info = res.get('user_info', dict())
        except:
            info = dict()
        self.user_info = info
        # 下載路徑
        username = '{}_{}_{}'.format(self.user_info.get('short_id', '0'),
                                     self.user_info.get('nickname', '無昵稱'), self.type)
        self.download_path = Download.title2path(username)  # 需提前處理非法字符串

    def get_challenge_info(self):
        """
        取話題挑戰信息
        查詢結果在 self.challenge_info
        """
        if self.url:
            self.__url2id()
        url = 'https://www.iesdouyin.com/web/api/v2/challenge/info/?ch_id=' + self.id
        try:
            res = self.http.get(url).json()
            info = res.get('ch_info', dict())
        except:
            info = dict()
        self.challenge_info = info
        # 話題挑戰下載路徑
        username = '{}_{}_{}'.format(self.challenge_info.get('cid', '0'),
                                     self.challenge_info.get('cha_name', '無標題'), self.type)
        self.download_path = Download.title2path(username)  # 需提前處理非法字符串

    def get_music_info(self):
        """
        取音樂原聲信息
        查詢結果在 self.music_info
        """
        if self.url:
            self.__url2id()
        url = 'https://www.iesdouyin.com/web/api/v2/music/info/?music_id=' + self.id
        try:
            res = self.http.get(url).json()
            info = res.get('music_info', dict())
        except:
            info = dict()
        self.music_info = info
        # 音樂原聲下載路徑
        username = '{}_{}_{}'.format(self.music_info.get('mid', '0'), self.music_info.get('title', '無標題'),
                                     self.type)
        self.download_path = Download.title2path(username)  # 需提前處理非法字符串

    def crawling_users_post(self):
        """
        采集用戶作品
        """
        self.type = 'post'
        self.__crawling_user()

    def crawling_users_like(self):
        """
        采集用戶喜歡
        """
        self.type = 'like'
        self.__crawling_user()

    def crawling_challenge(self):
        """
        采集話題挑戰
        """
        self.type = 'challenge'
        self.get_challenge_info()  # 取當前信息，用做下載目錄

        # https://www.iesdouyin.com/web/api/v2/challenge/aweme/?ch_id=1570693184929793&count=9&cursor=9&aid=1128&screen_limit=3&download_click_limit=0&_signature=AXN-GQAAYUTpqVxkCT6GHQFzfg
        url = 'https://www.iesdouyin.com/web/api/v2/challenge/aweme/'

        cursor = '0'
        while self.has_more:
            params = {
                "ch_id": self.id,
                "count": "21",  # 可調大 初始值：9
                "cursor": cursor,
                "aid": "1128",
                "screen_limit": "3",
                "download_click_limit": "0",
                "_signature": self.sign
            }
            try:
                res = self.http.get(url, params=params).json()
                cursor = res['cursor']
                self.has_more = res['has_more']
                self.__append_videos(res)
            except:
                print('話題挑戰采集出錯')
        print('話題挑戰采集完成')

    def crawling_music(self):
        """
        采集音樂原聲
        """
        self.type = 'music'
        self.get_music_info()  # 取當前信息，用做下載目錄

        # https://www.iesdouyin.com/web/api/v2/music/list/aweme/?music_id=6928362875564067592&count=9&cursor=18&aid=1128&screen_limit=3&download_click_limit=0&_signature=5ULmIQAAhRYNmMRcpDm2COVC5j
        url = 'https://www.iesdouyin.com/web/api/v2/music/list/aweme/'

        cursor = '0'
        while self.has_more:
            params = {
                "music_id": self.id,
                "count": "21",  # 可調大 初始值：9
                "cursor": cursor,
                "aid": "1128",
                "screen_limit": "3",
                "download_click_limit": "0",
                "_signature": self.sign
            }
            try:
                res = self.http.get(url, params=params).json()
                cursor = res['cursor']
                self.has_more = res['has_more']
                self.__append_videos(res)
            except:
                print('音樂原聲采集出錯')
        print('音樂原聲采集完成')

    def __crawling_user(self):
        """
        采集用戶作品/喜歡
        """
        self.get_user_info()  # 取當前用戶信息，昵稱用做下載目錄

        max_cursor = 0
        # https://www.iesdouyin.com/web/api/v2/aweme/like/?sec_uid=MS4wLjABAAAAaJO9L9M0scJ_njvXncvoFQj3ilCKW1qQkNGyDc2_5CQ&count=21&max_cursor=0&aid=1128&_signature=2QoRnQAAuXcx0DPg2DVICdkKEY&dytk=
        # https://www.iesdouyin.com/web/api/v2/aweme/post/?sec_uid=MS4wLjABAAAAaJO9L9M0scJ_njvXncvoFQj3ilCKW1qQkNGyDc2_5CQ&count=21&max_cursor=0&aid=1128&_signature=DrXeeAAAbwPmb.wFM3e63w613m&dytk=
        url = 'https://www.iesdouyin.com/web/api/v2/aweme/{}/'.format(self.type)

        while self.has_more:
            params = {
                "sec_uid": self.id,
                "count": "21",
                "max_cursor": max_cursor,
                "aid": "1128",
                "_signature": self.sign,
                "dytk": ""
            }
            try:
                res = self.http.get(url, params=params).json()
                max_cursor = res['max_cursor']
                self.has_more = res['has_more']
                self.__append_videos(res)
            except:
                print('作品采集出錯')
        print('作品采集完成')

    def __append_videos(self, res):
        """
        數據入庫
        """
        if res.get('aweme_list'):
            for item in res['aweme_list']:
                info = item['statistics']
                info.pop('forward_count')
                info.pop('play_count')
                info['desc'] = Download.title2path(item['desc'])  # 需提前處理非法字符串
                info['uri'] = item['video']['play_addr']['uri']
                info['play_addr'] = item['video']['play_addr']['url_list'][0]
                info['dynamic_cover'] = item['video']['dynamic_cover']['url_list'][0]
                info['status'] = 0  # 下載進度狀態；等待下載：0，下載中：0.xx；下載完成：1

                # 列表格式
                self.videosL.append(info)
                # 字典格式
                # self.videos[info['aweme_id']] = info

                # 此處可以直接添加下載任務，不過考慮到下載占用網速,影響采集過程，所以采集完再下載
            if self.limit:
                more = len(self.videos) - self.limit
                if more >= 0:
                    # 如果給出了限制采集數目，超出的刪除后直接返回
                    self.has_more = False
                    # 列表格式
                    self.videosL = self.videosL[:self.limit]
                    # 字典格式
                    # for i in range(more):
                    #     self.videos.popitem()
                    # return

        else:  # 還有作品的情況下沒返回數據則進入這里
            print('未采集完成，但返回作品列表為空')

    def download_all(self):
        """
        作品抓取完成后，統一添加下載任務
        可選擇在外部注冊回調函數，監聽下載任務狀態
        """
        for id, video in enumerate(self.videosL):
            # for id, video in self.videos.items():
            gid = self.aria2.download(url=video['play_addr'],
                                      filename='{}/{}_{}.mp4'.format(self.download_path, video['aweme_id'],
                                                                     video['desc'])
                                      # ,options={'gid': id}  # 指定gid
                                      )
            self.gids[gid] = id  # 因為傳入gid必須16位，所以就不指定gid了，另存一個字典映射
        print('下載任務投遞完成')

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 生成命令行接口--google開源的fire使用體驗【python-fire】 python3 使用aria2下載的一個腳本 aria2下載工具 Ubuntu安裝uget和aria2下載工具在Windows系統上搭建aria2下載器利用Centos7搭建aria2下載器 python制作命令行工具——fire 在ubuntu1604上使用aria2下載coco數據集效率非常高安裝使用aria2下載百度網盤內容（轉）黑群暉Aria2下載BT，磁力&PT自用詳細設置