數據來源: 應用寶
開發環境:win10、python3.7
開發工具:pycharm、Chrome
明確需要采集的數據:
- app的下載地址
- app的下載次數
- app的名字
- app的開發公司
提取到頁面的分類標簽
獲取到a標簽的href屬性
用於之后拼接動態地址
找到動態加載的app數據加載地址
url的值是每個分類標簽的值
https://sj.qq.com/myapp/cate/appList.htm?orgame=1&categoryId=-10&pageSize=20&pageContext=undefined
拼接新的url值發送請求
import requests # 工具包發送網絡請求
from lxml import etree # 轉換成對象
import csv # 處理表格數據
url = "https://sj.qq.com/myapp/category.htm?orgame=1"
response = requests.get(url)
html_data = etree.HTML(response.text)
li_list = html_data.xpath('//ul[@data-modname="cates"][position()>1]/a/@href')
del(li_list[-1])
for url1 in li_list:
for i in range(10):
new_url = "https://sj.qq.com/myapp/cate/appList.htm" + url1 + "&pageSize=20&pageContext={}".format(i*20)
res = requests.get(new_url).json()
if res["count"] == 0:
break
with open("應用寶.csv", "a", newline="", encoding="utf-8")as f:
csv_data = csv.DictWriter(f, fieldnames=["appName", 'authorName', "apkUrl"])
for info in res["obj"]:
appName = info['appName']
authorName = info['authorName']
apkUrl = info['apkUrl']
print({"appName": appName, "authorName": authorName, "apkUrl": apkUrl})
csv_data.writerow({"appName": appName, "authorName": authorName, "apkUrl": apkUrl})