使用Python爬蟲代碼獲取數據到Power Query中

本文轉載自查看原文 2019-08-26 15:58 774

通過前面的幾篇文章Power BI Python 在Power BI Desktop中使用Python導入數據、Power BI Python 在Power BI Desktop中Python代碼如何使用Power Query數據，我們簡單的了解了如何在Power BI運行Python代碼，那么今天我們就用一個實際的Python爬蟲代碼來跑下。

本示例的代碼目的是將我的博客園所有的帖子的基本信息都爬取下來，包括發布日期、發布時間、標題、閱讀量以及評論數都提取出來。

import requests
from lxml import etree
import pandas as pd

base_url="https://www.cnblogs.com/alexywt/default.html?page="
articles=pd.DataFrame(columns=('day','title','desc'))

def getArticles(startId,endId):
    for i in range(startId,endId):
        getArticlesFromPage(i)


def getArticlesFromPage(pageId):
    global articles

    url=base_url+str(pageId)

    resp=requests.get(url)
    html=etree.HTML(resp.text)
    days= html.xpath('//div[@class="day"]')
    for day in days:
        article=getArticle(day)
        if not article is None:
            articles=articles.append(article,ignore_index=True)


def getArticle(div_day):
    article_title=div_day.xpath('.//div[@class="postTitle"]/a/text()')[0]
    article_title=article_title.replace("\n","")
    if article_title[:4]=="[置頂]":
        return None

    article_title=article_title.strip()
    day_title=div_day.xpath('.//div[@class="dayTitle"]/a/text()')[0]
    post_desc=div_day.xpath('.//div[@class="postDesc"]/text()')[0]
    post_desc=post_desc.replace("\n","")

    article=pd.Series({
        'day': day_title,
        'title': article_title,
        'desc': post_desc.strip()
    })

    return article



if __name__=='__main__':
    getArticles(1,7)
    print(articles)

代碼簡單說明：

1、首先定義了基地址以及一個最終用於存儲所有數據的DataFrame對象

2、隨后定義了3個函數，這三個函數的功能如下所示

函數名	參數	功能
getArticles	startId：起始頁的頁碼 endId：結束頁的頁碼	獲取從起始頁到結束頁中所有的文章信息
getArticlesFromPage	pageId：當前頁的頁碼	獲取指定頁中所有文章的信息，並且清除掉置頂文章
getArticle	div_day：當前文章所在的div元素	從當前文章所在的div元素中提取文章的信息，並存入一個Series對象中

在Power BI中運行以上Python代碼后，導入的結果如下圖所示

通過Power Query的一些變換操作將結果轉變為如下結果

最后通過Power BI的Power View設計幾個圖表，我這里比較隨意的配置了幾種

注意：Power BI集成Python代碼后，后續如果有新博客文章發布，或者閱讀量增加時，只需要打開Power BI文件，刷新一下預覽即可得到新的結果

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Power BI Python 在Power BI Desktop中Python代碼如何使用Power Query數據 Excel中使用Power Query獲取網頁json數據使用Power Query從Web頁面獲取圖像到Power BI報告中 python爬蟲代碼中_獲取狀態碼如何使用Power Query自動存儲最近2年的數據 Power BI Power Query 批量導入1-單Excel工作簿中的所有工作表數據 Power BI Power Query 批量導入2-多Excel工作簿中的所有工作表數據 python爬蟲獲取localStorage中的數據（獲取token） Power Query power BI-數據處理（Excel power query）