文本離散表示（二）：新聞語料的one-hot編碼

本文轉載自查看原文 2019-03-16 22:26 1151 one-hot/ keras/ 自然語言處理

上一篇博客介紹了文本離散表示的one-hot、TF-IDF和n-gram方法，在這篇文章里，我做了一個對新聞文本進行one-hot編碼的小實踐。

文本的one-hot相對而言比較簡單，我用了兩種方法，一種是自己造輪子，第二種是用深度學習框架keras來做。同時，我發現盡管sklearn可以實現對特征向量的one-hot，但並不適用於文本的處理。

代碼和新聞文本文件可到我github主頁下載：https://github.com/DengYangyong/one_hot_distribution。發現新聞文本文件好大。。。不該上傳完整的。

一、關於sklearn進行one-hot

在翻閱介紹文本one-hot的博客時，看到有博主舉了一個用sklearn實現one-hot的小例子，我也把代碼運行了一遍。

代碼和輸出結果如下：

from sklearn import preprocessing  
import numpy as np

enc = preprocessing.OneHotEncoder()  # 創建對象
array1 = np.asarray([[0,0,3],[1,1,0],[0,2,1],[1,0,2]])
enc.fit(array1)   # 擬合
array2 = enc.transform([[0,1,3]]).toarray()  # 轉化
print("4個樣本的特征向量是：\n", array1,'\n')
print('對第5個樣本的特征向量進行one-hot：', array2)

輸出結果：

4個樣本的特征向量是：
 [[0 0 3]
 [1 1 0]
 [0 2 1]
 [1 0 2]] 

對第5個樣本的特征向量進行one-hot： [[1. 0. 0. 1. 0. 0. 0. 0. 1.]]

什么意思呢？拿西瓜做為例子，有4個西瓜，每個西瓜取3個特征，{"瓜籽": ['有', '無'], "根蒂": ['硬挺','蜷縮','枯萎'], "顏色": ['白','黑','綠','黃']}，第一個特征有兩個屬性，用0和1表示，第二個特征有三個屬性，用0、1和2表示，第三個特征有4個屬性，用0-3表示。所以array1這個4*3的數組表示4個西瓜的特征向量，4表示樣本數，3表示特征數。那么在進行one-hot時，每個樣本的向量的維度是9（2+3+4）維，也就是屬性的數量進行累加。樣本第一個特征取0時，表示為[1,0]; 第二個特征為1時，表示為[0,1,0]；第三個特征為3時，表示為[0,0,0,1]，然后按順序拼成一個向量，就得到了[1,0,0,1,0,0,0,0,1]，這個9維的向量也就是第5個西瓜的特征向量的one-hot。

那可以用來做文本的one-hot嗎？不可以，因為上面的例子是把向量轉為了one-hot，而文本one-hot是把類似[['天', '氣', '很','好'],['我','愛','自','然','語','言','處','理']]這樣的形式轉為one-hot，這樣的形式轉為向量要么是詞頻矩陣，要么直接就是one-hot。所以sklearn的這個函數是沒法做文本的one-hot的。

二、自己造輪子實現文本one-hot

這次是對新聞文本進行one-hot，cnews.train.txt 這個文件中有幾十萬篇新聞文本，有10類新聞，分別是 ['體育', '財經', '房產', '家居', '教育', '科技', '時尚', '時政', '游戲', '娛樂']。文件的每一行就是一篇新聞，開頭兩個字是類別，后面是新聞內容，中間用"\t"隔開。比如第一行的內容：

體育    馬曉旭意外受傷讓國奧警惕 無奈大雨格外青睞殷家軍記者傅亞雨沈陽報道 來到沈陽，國奧隊依然沒有擺脫雨水的困擾。......

考慮到內容較大，因此只讀取1000篇新聞文本進行one-hot。經過了幾個步驟：把文本處理成char-level的列表——取前1000個詞頻最高的字作為字典——為字典的每個字分配索引——進行one-hot編碼。

最終得到的是1000*1000維的one-hot數組。打開保存的文本一看，密密麻麻的0，星星點點的1，足以看到數據的稀疏性。

import numpy as np
import os

text_path = os.getcwd()+os.sep+'cnews'+os.sep
fp = open(text_path+'cnews.train.txt','r',encoding='utf8')
fout = open(text_path+'cnews_one_hot_diy.txt','w',encoding='utf8')

#把新聞處理成char_level的列表,去除停用詞
# 輸出：[['馬', '曉', '旭', '意', '外', '受', '傷', '國', '奧', '警', '惕', '', '奈',...],[],...[]]
news_list = []
news_count =0
fp_stop = open("ChineseStopWords.txt","r",encoding="utf8") #打開中文停用詞表用於過濾
dict_stopwords = [stopwords.strip() for stopwords in fp_stop.readlines()]
for text in fp.readlines() :
    if text.strip() and news_count < 1000: #只取前1000篇新聞文本
        label,news = text.strip().split('\t')  #把類別標簽和文本分開
        news_list.append([char.strip() for char in list(news) if char not in dict_stopwords]) # 把文本切成一個個字並過濾停用詞
        news_count += 1
print("讀取了cnews.train.txt 中的"+str(news_count)+"條新聞。\n")
fp_stop.close()
fp.close()

#把每個字的詞頻統計出來
# 輸出：{'馬': 1379, '曉': 31, '旭': 25, '意': 855, '外': 1218, '受': 872, '傷': 1169, '國': 1046, ...}
token_count = {}
for list_ in news_list:
    for word in list_:
        if word not in token_count:
            token_count[word] = 1
        else:
            token_count[word] += 1
            
# 按詞頻對字進行排序，取前1000個詞頻最高的字
# token_sort輸出：{'分': 8759, '球': 8193, '場': 7862, '斯': 7731, '賽': 7040, '中': 6537, '隊': 6169, ...}
# dict.items()是把字典轉化為元素為元祖的列表，然后按照元祖的第2個值進行排序，即詞頻。
token_tuple = sorted(token_count.items(), key=lambda item: item[1],reverse=True) 
 # 取前1000個詞頻最高的字，轉化為字典。
token_sort = dict(token_tuple[:1000])

#為每一個字分配索引，不為0分配字，貌似是深度學習中的習慣。
# 輸出：{'分': 1, '球': 2, '場': 3, '斯': 4, '賽': 5, '中': 6, '隊': 7, '時': 8, '出': 9, '籃': 10, '火': 11, '次': 12, '前': 13, ...}
token_index = dict(zip(token_sort.keys(),range(1,len(token_sort))))

# 進行進行one-hot編碼
# result[0]輸出為：array([0., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 1., 0., 0., 0.,..])
result = np.zeros(shape=(1000,1000))
for i,list_ in enumerate(news_list):
    for word in list_:
        if word in token_index:
            j = token_index[word]
            result[i,j] = 1
            
# 用numpy 將one-hot按int格式保存，方便下次打開為ndarray格式，而不是文本。
np.savetxt(text_path+"cnews_one_hot_diy.txt",result,fmt="%d")
fout.close()

第一篇文本的one-hot：

三、用keras實現one-hot

使用keras需要先安裝tensorflow，用宇宙最牛公司的深度學習框架來做one-hot，有殺雞用牛刀的感覺，所以說這是豪華版。不過可以通過這個過程來掌握一些python庫的用法，也是非常有意思的。過程比較簡單，就兩步：文本處理成char-level的列表——進行one-hot編碼。

不多說了，上代碼！

#coding=utf8
from keras.preprocessing.text import Tokenizer
from collections import Counter
import os
from itertools import chain
import numpy as np

text_path = os.getcwd()+os.sep+'cnews'+os.sep
fp = open(text_path+'cnews.train.txt','r',encoding='utf8')
fout = open(text_path+'cnews_one_hot.txt','w',encoding='utf8')
news_list = []

#把新聞處理成char_level的列表,去除停用詞
# 輸出：[['馬', '曉', '旭', '意', '外', '受', '傷', '國', '奧', '警', '惕', '', '奈',...],[],...[]]
news_count =0
fp_stop = open("ChineseStopWords.txt","r",encoding="utf8")
dict_stopwords = [stopwords.strip() for stopwords in fp_stop.readlines()]
for text in fp.readlines() :
    if text.strip() and news_count < 1000:
        label,news = text.strip().split('\t')
        news_list.append([char.strip() for char in list(news) if char not in dict_stopwords])
        news_count += 1
print("讀取了cnews.train.txt 中的"+str(news_count)+"條新聞。\n")
fp_stop.close()
fp.close()

# 統計並查看這些新聞中所有的字的個數
# 其實對於這個問題而言不重要
counter = Counter(list(chain.from_iterable(news_list)))
char_count = len(counter)
print(str(news_count)+"條新聞中有"+str(char_count)+"個不同的字。\n")

#用keras進行one-hot編碼，取前1000個頻率最高的字作為字典
tokenizer=Tokenizer(num_words=1000)
tokenizer.fit_on_texts(news_list) 
one_hot_results = tokenizer.texts_to_matrix(news_list,mode='binary') #直接由[['馬', '曉', '旭', '意', ...],..[]]得到one-hot矩陣，還是方便。
print("獨熱編碼后每條新聞的向量長度為：", one_hot_results.shape[1],'\n')

# 用numpy把array數組保存，方便下次再以arry的格式打開，而不是字符串格式。
np.savetxt(text_path+"cnews_one_hot.txt",one_hot_results,fmt="%d")
fout.close()

# 將保存的array數組文件打開
news_one_hot = np.loadtxt(text_path+"cnews_one_hot.txt",dtype=float)
print("數據的格式為", type(news_one_hot))

打印的內容：

讀取了cnews.train.txt 中的1000條新聞。

1000條新聞中有3164個不同的字。

獨熱編碼后每條新聞的向量長度為： 1000 

數據的格式為 <class 'numpy.ndarray'>

第一篇文本的one-hot：

參考資料：

《Python深度學習》

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 one-hot 編碼 one-hot編碼 one-hot編碼理解詳解one-hot編碼離散型特征編碼方式：one-hot與啞變量* 幾種實現one-hot編碼的方式 One-Hot Encoding（獨熱編碼） one-hot編碼（pytorch實現）自然語言處理3-1：文本表示之one-hot representation 獨熱編碼（One-Hot）的理解