python+beautifulsoup爬取華為應用市場的應用信息

本文轉載自查看原文 2018-04-02 23:15 1201 python

今天老師課上突然坐我旁邊神秘地給我布置了一個任務：幫他把華為應用市場中的應用按類別選擇100多個應用，把應用名、類別、url、下載次數放到excel中

（(;¬_¬)難道是我今天上課遲到的懲罰？）

大概是圖里的這些信息

答應下來以后，想想Ctrl+C Ctrl+V這么多信息還是有點麻煩的，回去的路上想到這事兒可以寫個爬蟲解決_(・ω・｣ ∠)_

F12后可以看到相應標簽的class等屬性，不過下載次數直接就是個span標簽，所以我用的text正則匹配

代碼如下：(..•˘_˘•..)

import xlsxwriter
from bs4 import BeautifulSoup
import re
from urllib import request

#把應用名、類別、url、下載次數寫入excel,因為只需要打開一次文件，所以把file和sheet定義為全局變量
def write_excel(name, type_name, url, download):
    # 全局變量row代表行號 0-4代表列數
    global row
    sheet.write(row, 0, row)
    sheet.write(row, 1, name)
    sheet.write(row, 2, type_name)
    sheet.write(row, 3, url)
    sheet.write(row, 4, download)
    row += 1


def get_list(url):
    # 請求url
    req = request.Request(url)
    # 設置請求頭
    req.add_header("User-Agent", "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0")
    # 得到響應對象
    response = request.urlopen(req)
    # 得到Beautiful對象
    soup = BeautifulSoup(response, "html.parser")
    # 找到第一個class為key-select txt-sml的標簽
    type_name = soup.find(attrs={"class": "key-select txt-sml"})
    # 找到所有應用名title所在的標簽
    title_divs = soup.find_all(attrs={"class": "title"})
    for title_div in title_divs:
        if title_div.a is not None:
            name = title_div.a.text
            # a['href']得到a的herf屬性內容
            url = "http://app.hicloud.com" + title_div.a['href']
            # string[3:]截取從第三個字符開始到末尾
            download = title_div.parent.find(text=re.compile("下載:"))[3:]
            write_excel(name, type_name.text, url, download)

#全局變量:row用來定義行數,方便寫入excel行數一直累加,file和sheet因為創建一次就可以
row = 1
# 新建一個excel文件
file = xlsxwriter.Workbook('applist.xlsx')
# 新建一個sheet
sheet = file.add_worksheet()
if __name__ == '__main__':
    #暫時列出兩個類型
    url_1 = "http://app.hicloud.com/soft/list_23"
    url_2 = "http://app.hicloud.com/soft/list_24"
    get_list(url_1)
    get_list(url_2)
    file.close()

實現效果部分截圖如下：ヾ(*´▽‘*)ﾉ

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python3.6+BeautifulSoup4.2 爬取各類app應用信息並下載app包 Python使用BeautifulSoup爬取網頁信息 Python爬蟲-爬取手機應用市場中APP下載量 python應用：selenium之爬取天眼查信息 Python高級應用課程設計作業——二手車市場數據爬取與分析 Python應用之爬取一本pdf 華為雲市場應用接入接口調試華為應用市場AGC研習社直播：App個人信息安全保護審核標准解讀 Python和BeautifulSoup進行網頁爬取 Python爬蟲學習之使用beautifulsoup爬取招聘網站信息