利用Github Actions自動同步博客園最新內容到GitHub首頁

本文轉載自查看原文 2020-07-21 09:10 643

倉庫地址，動手能力強的直接看代碼修改即可！

在GitHub上面創建一個同名倉庫，比如我的id為realzhaijiayu，我就創建一個倉庫名為realzhaijiayu的倉庫，里面的README會直接在個人Github首頁渲染展示。

想讓首頁自動更新博客園上面的播客鏈接，可以使用GitHub自帶的CI工具GitHub Actions。

總體的思路：

用Python爬取博客園的文章鏈接
編寫GitHub Actions的配置腳本

編寫爬蟲

首先分析博客園的網頁（以cnblogs.com/realzhaijiayu為例）

這是一篇文章的HTML結構，上面框里面的內容是我們想要的信息。

借助Python里面的BeautifulSoup庫，可以快速地將這些信息提取出來。

'''
爬取博客園某個作者所有文章
'''
from bs4 import BeautifulSoup
import requests
import sys

original_stdout = sys.stdout  # Save a reference to the original standard output


def get_bs(author, page=1):
    '''
    傳入作者博客園的id，頁數（不傳頁數則從第一頁開始查找）
    如果存在下一頁按鈕，則遞歸調用自己獲取下一頁的數據
    '''
    r = requests.get(f'https://www.cnblogs.com/{author}/default.html?page={page}')
    soup = BeautifulSoup(r.content, 'html5lib')
    # print(f'第{page}頁：')
    data_print(soup)
    # if soup.select(f'a[href="https://www.cnblogs.com/{author}/default.html?page={page+1}"]'):  # 如果有下一頁的鏈接
    #     get_bs(author, page+1)

def data_print(soup):  # 這里可以優化顯示文章鏈接啥的
    '''
    通過css選擇器打印所有日期和文章標題
    '''
    with open('README.md', 'w') as f:
        sys.stdout = f  # Change the standard output to the file we created.

        for day in soup.select('div.day'):
            for date in day.select('div.dayTitle a'):# 每天只有一個日期
                for aritle in day.select('a.postTitle2'): # 每天可能有多篇文章
                        print('- ',date.text, ' ', '[', aritle.get_text().strip(), '](', aritle.get('href'), ')', sep='')
        sys.stdout = original_stdout  # Reset the standard output to its original value


if __name__ == "__main__":
    get_bs('realzhaijiayu')

該腳本可以支持爬取多頁內容，我只想要最近的幾篇，所以只要第一頁的就可以了。

編寫Actions配置文件

# This is a basic workflow to help you get started with Actions

name: Refresh

# Controls when the action will run. Triggers the workflow on push or pull request
# events but only for the master branch
on:
  push:
  schedule:
    - cron: '00 *   * * *'

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
  # This workflow contains a single job called "build"
  build:
    # The type of runner that the job will run on
    runs-on: ubuntu-latest

    # Steps represent a sequence of tasks that will be executed as part of the job
    steps:
    # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
    - uses: actions/checkout@v2
    - name: Set up Python 3.8
      uses: actions/setup-python@v2
      with:
        python-version: 3.8
    - name: Install dependencies
      run: |
            python -m pip install --upgrade pip
            pip install -r requirements.txt
    # Runs a set of commands using the runners shell
    - name: run script
      run: |
        python3 refresh.py
    - name: Commit files
      run: |
        git config --global user.email "realzhaijiayu@gmail.com"
        git config --global user.name "realzhaijiayu"
        git commit -m "update" -a || exit 0
    - name: Push changes
      uses: ad-m/github-push-action@master
      with:
        github_token: ${{ secrets.GITHUB_TOKEN }}

注意的是，該Python腳本需要的三個依賴分別是

requests
bs4
html5lib

需要寫在requirements.txt里面。

效果：

自己動手做

可以直接將倉庫clone下來，然后修改部分內容，再上傳到自己的同名倉庫即可。

需要修改的內容如下：

博客園id(refresh.py)
git配置的郵箱和用戶名(.github/workflows/actions.yml)

總結

一開始爬取博客園鏈接的時候，由於我只會編寫shell腳本，導致寫起來有點麻煩，光是提取鏈接就用了很多sed，感覺一點都不優雅。

后面看到別人用Python寫的爬蟲，代碼很簡潔，思路非常清晰。對於一個沒有學過Python的人，閱讀起來也沒有任何問題。之所以用Python寫這么簡單，是因為BeautifulSoup這個庫太好用了，不需要i自己動手切割HTML標簽，直接指定標簽即可，它會自動幫助你提取信息，太爽了！

后面查閱資料，發現好多小工具都是用Python編寫的，看來Python有必要學習一下。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 博客園個人首頁背景設置博客園個人首頁背景設置利用博客園搭建個人博客為博客園添加github跳轉鏈接博客園添加GitHub角標博客園添加GitHub鏈接從github hexo 跑來博客園利用github actions貼吧全自動簽到微信快速開發框架（五）-- 利用快速開發框架，快速搭建微信瀏覽博客園首頁文章致博客園團隊？關於自己寫了個文章還被移除首頁！