曲牌名

獲取來源

之前爬取過元代的詩，眾所周知，曲牌名出自於元代，唐詩，宋詞，元曲

收集曲牌名

import pandas as pd import xlwt #讀取yuan代的詩詞
def read(file): data=pd.read_excel(file) title=data.title # 存儲一個曲排名列表
    qu_list=[] for it in title: if it.find('·')!=-1: # 根據詩詞名獲取對應的曲牌名
            qu=it.split('·') qu_list.append(qu[0]) new_qu=list(set(qu_list)) #將曲牌名進行保存
    xl = xlwt.Workbook() # 調用對象的add_sheet方法
    sheet1 = xl.add_sheet('sheet1', cell_overwrite_ok=True) sheet1.write(0, 0, "qu_name") for i in range(0, len(new_qu)): sheet1.write(i + 1, 0, new_qu[i]) xl.save("qupai_name.xlsx") if __name__ == '__main__': file='data/yuan.xlsx' read(file)

成果展示

飛花令

獲取來源

古詩詞網上有專門的飛花令字詞，因此我們的來源就是它

爬取飛花令

import requests from bs4 import BeautifulSoup from lxml import etree headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}#創建頭部信息
 hc=[] url='https://www.xungushici.com/feihualings' r=requests.get(url,headers=headers) content=r.content.decode('utf-8') soup = BeautifulSoup(content, 'html.parser') ul=soup.find('ul',class_='list-unstyled d-flex flex-row flex-wrap align-items-center w-100') li_list=ul.find_all('li',class_='m-1 badge badge-light') word=[] for it in li_list: word.append(it.a.text) import xlwt xl = xlwt.Workbook() # 調用對象的add_sheet方法
sheet1 = xl.add_sheet('sheet1', cell_overwrite_ok=True) sheet1.write(0,0,"word") for i in range(0,len(word)): sheet1.write(i+1,0,word[i]) xl.save("word.xlsx")

結果展示

詩句-飛花令

思路

通過遍歷爬取的50萬首古詩，分析每個句子是否有包含的飛花令中的關鍵字，如果有將其存儲起來：詩句、作者、詩名、關鍵字

BUG

如果用xlwt來存儲，最多存儲65536行數據，用openpyxl可以存儲100萬行數據。由於我們的詩句數據過大，因此需采用openpyxl來進行存儲

代碼

import pandas as pd import xlwt import openpyxl #讀取飛花令
def read_word(): data=pd.read_excel('data2/word.xlsx') words=data.word return words #遍歷詩句
def read(file,words,write_file): data=pd.read_excel(file) title=data.title content=data.content author=data.author #進行切分出單句
    ans_sentens = [] ans_author = [] ans_title = [] ans_key = [] for i in range(len(title)): print("第"+str(i)+"個") cont=content[i] aut=author[i] tit=title[i] sents=cont.replace('\n','').split('。') for it in sents: key_list = [] for k in words: if it.find(k)!=-1: key_list.append(k) if len(key_list)!=0: ans_sentens.append(it) ans_author.append(aut) ans_title.append(tit) ans_key.append(",".join(key_list)) #存儲對應的key，author，title，sentenous
    xl = openpyxl.Workbook() # 調用對象的add_sheet方法
    sheet1 = xl.create_sheet(index=0) sheet1.cell(1, 1, "sentens") sheet1.cell(1, 2, "author") sheet1.cell(1, 3, "title") sheet1.cell(1, 4, "keys") for i in range(0, len(ans_key)): sheet1.cell(i + 2, 1, ans_sentens[i]) sheet1.cell(i + 2, 2, ans_author[i]) sheet1.cell(i + 2, 3, ans_title[i]) sheet1.cell(i + 2, 4, ans_key[i]) xl.save(write_file) print("保存成功到-"+write_file) #獲取指定文件夾下的excel
import os def get_filename(path,filetype):  # 輸入路徑、文件類型例如'.xlsx'
    name = [] for root,dirs,files in os.walk(path): for i in files: if os.path.splitext(i)[1]==filetype: name.append(i) return name            # 輸出由有后綴的文件名組成的列表

if __name__ == '__main__': file='data/' words=read_word() list = get_filename(file, '.xlsx') for i in range(len(list)): new_file=file+list[i] print(new_file) sentences_file = "sentences/sentence" + str(i+1) + ".xlsx" read(new_file,words,sentences_file)

結果展示

明日任務

先學習常見的中文分詞工具，分出對應的相關實體，做個小demo嘗試

中文分詞，試圖將詩人個人經歷，逐個分段，梳理出這幾類關鍵信息：人物，時間，事件，地點。將文本抽取為規則化的數據格式。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 飛花令挑釁詩句《了不起的蓋茨比》【開頭詩句，尼克獨白】 Centos查看端口占用令 1104冒泡排序語法樹 SSSE3指令集 LNK1104 無法打開文件“xxx.lib” avalon2學習教程15指令總結 SSE4.1指令集系列之一 x86-64指令系統