說明:
1.數據來源:WoS文獻數據
2.python讀取excel中存儲的數據
3.通過分句、分詞、去停用詞、詞形還原分析TI(篇名)與AB(摘要)中的文本
4.lda可采用的庫有sklearn,gensim(本文采用的這個)等;sklearn基於EM算法,gensim基於Gibbs采樣的MCMC算法
對excel中的數據進行操作
#讀取excel數據 from pprint import pprint import xlrd path = r"D:\02-1python\2020.08.11-lda\data\2010-2011\usa\us1.xlsx"#修改路徑 data = xlrd.open_workbook(path)
#第一列,第二列
sheet_1_by_index = data.sheet_by_index(0) title = sheet_1_by_index.col_values(0) abstract = sheet_1_by_index.col_values(1) n_of_rows = sheet_1_by_index.nrows doc_set = []#空列表 for i in range(1,n_of_rows):#逐行讀取 doc_set.append(title[i] + '. ' + abstract[i])
doc_set[0]
'The impact of supply chain integration on performance: A contingency and configuration approach.This study extends the developing body of literature on supply chain integration (SCI), which is the degree to which a manufacturer strategically collaborates with its supply chain partners and collaboratively manages intra- and inter-organizational processes, in order to achieve effective and efficient flows of products and services, information, money and decisions, to provide maximum value to the customer. The previous research is inconsistent in its findings about the relationship between SCI and performance. We attribute this inconsistency to incomplete definitions of SCI, in particular, the tendency to focus on customer and supplier integration only, excluding the important central link of internal integration. We study the relationship between three dimensions of SCI, operational and business performance, from both a contingency and a configuration perspective. In applying the contingency approach, hierarchical regression was used to determine the impact of individual SCI dimensions (customer, supplier and internal integration) and their interactions on performance. In the configuration approach, cluster analysis was used to develop patterns of SCI, which were analyzed in terms of SCI strength and balance. Analysis of variance was used to examine the relationship between SCI pattern and performance. The findings of both the contingency and configuration approach indicated that SCI was related to both operational and business performance. Furthermore, the results indicated that internal and customer integration were more strongly related to improving performance than supplier integration. (C) 2009 Elsevier B.V. All rights reserved.'
列表中就可以直接使用了,如需存儲一個txt文本到本地的話,見下。
#存儲為txt到指定路徑下
file_path = 'D:/02-1python/2020.08.11-lda/data/2010-2011/china/2695.txt' with open(file_path,'a') as file_handle: # .txt可以不自己新建,代碼會自動新建 file_handle.write(str(doc_set[0:])) # 寫入 file_handle.write('\n') # 有時放在循環里面需要自動轉行,不然會覆蓋上一條數據
數據預處理部分
import nltk #分句 from nltk.tokenize import sent_tokenize #分詞 from nltk.tokenize import word_tokenize #去停用詞 from nltk.corpus import stopwords #詞形還原 from nltk.stem import WordNetLemmatizer #詞干提取 from nltk.stem.porter import PorterStemmer
english_stopwords = stopwords.words("english") #自定義英文表單符號列表 english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '!', '@', '#', '%', '$', '*', "''"]
texts = []
#對每篇文獻進行處理 for doc in doc_set: #分詞 text_list = nltk.word_tokenize(doc) #去停用詞1 text_list0 = [word for word in text_list if word not in english_stopwords] #去停用詞2自編,這里是我自己覺得需要去掉的詞 english_stopwords2 = ['c', 'also', '2009', '2010', '2011', "'s"]#修改停用詞:年份 text_list1 = [word for word in text_list0 if word not in english_stopwords2] #去標點符號 text_list2 = [word for word in text_list1 if word not in english_punctuations] #詞形還原 text_list3 = [WordNetLemmatizer().lemmatize(word) for word in text_list2] #詞干化 text_list4 = [PorterStemmer().stem(word) for word in text_list3]
#最終處理好的結果存放於text[]中 texts.append(text_list4)
#查詢文檔(我這里是文獻的數目)
M = len(texts) print('文本數目:%d個' % M)
lda部分
#利用 gensim 庫構建文檔-詞頻矩陣 import gensim from gensim import corpora #構建字典,把剛剛處理好的詞都存進去 dictionary = corpora.Dictionary(texts)
#構建文檔-詞頻矩陣,得到的是詞袋矩陣,也可以進一步使用TF-IDF,這里未使用 corpus = [dictionary.doc2bow(text) for text in texts] print('\n文檔-詞頻矩陣:') #pprint(corpus) pprint(corpus[0:19]) #for c in corpus: #print(c)
#轉換成文檔詞頻稀疏矩陣 from gensim.matutils import corpus2dense corpus_matrix=corpus2dense(corpus, len(dictionary)) corpus_matrix.T
類似於這種[0 1 3 0 2 2;···]
#使用gensim來創建 LDA 模型對象 Lda = gensim.models.ldamodel.LdaModel #在文檔-詞頻矩陣上運行和訓練 LDA 模型 num_topics = 10#主題個數,參數可修改 ldamodel = Lda(corpus, num_topics=num_topics, id2word=dictionary, passes=100)#修改超參數,主題個數,遍歷次數 doc_topic = [doc_t for doc_t in ldamodel[corpus]] print('文檔-主題矩陣:\n') #pprint(doc_topic) pprint(doc_topic[0:19]) #for doc_topic in ldamodel.get_document_topics(corpus): #print(doc_topic) print('主題-詞:\n') for topic_id in range(num_topics): print('Topic', topic_id) pprint(ldamodel.show_topic(topic_id))
什么主題數才是最適合的呢,可以采用一致性評分或困惑度。
#一致性評分 print('一致性評分:\n') coherence_model_lda = gensim.models.CoherenceModel(model=ldamodel,texts=texts,dictionary=dictionary,coherence='c_v') coherence_lda=coherence_model_lda.get_coherence() print('\nCoherence Score: ', coherence_lda)
references:
[1]謝婷. 基於LDA模型的人工智能領域前沿識別研究[D].南京航空航天大學,2019.
[2]https://datartisan.gitbooks.io/begining-text-mining-with-python/content/
[3]更多參考:https://t.zsxq.com/R7QbQrv