LSTM 文本情感分析/序列分類 Keras

本文轉載自查看原文 2017-04-24 13:12 3910 python/ LSTM/ 數據挖掘及機器學習

LSTM 文本情感分析/序列分類 Keras

請參考 http://spaces.ac.cn/archives/3414/

neg.xls是這樣的

pos.xls是這樣的

neg=pd.read_excel(‘neg.xls’,header=None,index=None)

pos=pd.read_excel(‘pos.xls’,header=None,index=None) #讀取訓練語料完畢

pos[‘mark’]=1

neg[‘mark’]=0 #給訓練語料貼上標簽

pn=pd.concat([pos,neg],ignore_index=True) #合並語料

neglen=len(neg)

poslen=len(pos) #計算語料數目

cw = lambda x: list(jieba.cut(x)) #定義分詞函數

pn[‘words’] = pn[0].apply(cw)

comment = pd.read_excel(‘sum.xls’) #讀入評論內容

#comment = pd.read_csv(‘a.csv’, encoding=’utf-8′)

comment = comment[comment[‘rateContent’].notnull()] #僅讀取非空評論

comment[‘words’] = comment[‘rateContent’].apply(cw) #評論分詞

d2v_train = pd.concat([pn[‘words’], comment[‘words’]], ignore_index = True)

w = [] #將所有詞語整合在一起

for i in d2v_train:

w.extend(i)

dict = pd.DataFrame(pd.Series(w).value_counts()) #統計詞的出現次數

del w,d2v_train

dict[‘id’]=list(range(1,len(dict)+1))

get_sent = lambda x: list(dict[‘id’][x])

pn[‘sent’] = pn[‘words’].apply(get_sent)

maxlen = 50

print “Pad sequences (samples x time)”

pn[‘sent’] = list(sequence.pad_sequences(pn[‘sent’], maxlen=maxlen))

x = np.array(list(pn[‘sent’]))[::2] #訓練集

y = np.array(list(pn[‘mark’]))[::2]

xt = np.array(list(pn[‘sent’]))[1::2] #測試集

yt = np.array(list(pn[‘mark’]))[1::2]

xa = np.array(list(pn[‘sent’])) #全集

ya = np.array(list(pn[‘mark’]))

print ‘Build model…’

model = Sequential()

model.add(Embedding(len(dict)+1, 256))

model.add(LSTM(256, 128)) # try using a GRU instead, for fun

model.add(Dropout(0.5))

model.add(Dense(128, 1))

model.add(Activation(‘sigmoid’))

model.compile(loss=’binary_crossentropy’, optimizer=’adam’, class_mode=”binary”)

print ‘Fit model…’

model.fit(xa, ya, batch_size=32, nb_epoch=4) #訓練時間為若干個小時

classes = model.predict_classes(xa)

acc = np_utils.accuracy(classes, ya)

print ‘Test accuracy:’, acc

可以試一試

w = [] #將所有詞語整合在一起

for i in d2v_train:

w.extend(i)

newList = list(set(w))

print “newlist len is”

print len(newList)

dict = pd.DataFrame(pd.Series(w).value_counts()) #統計詞的出現次數

print type(dict)

print len(dict)

可以發現print len(newList)結果和print len(dict) 也就是說dict的長度就是所有不重復詞語的distinct的長度。

主要有一個這個函數 sequence.pad_sequences

https://keras.io/preprocessing/sequence/#pad_sequences

http://www.360doc.com/content/16/0714/10/1317564_575385964.shtml

如果指定了參數maxlen，比如這里maxlen為50，那么意思就是這里每句話只截50個單詞，后面就不要了，如果一句話不足50個單詞，則用0補齊。

首先， Word2Vec 將詞語對應一個多維向量，

model.add(Embedding(len(dict)+1, 256))

參數參考 http://www.360doc.com/content/16/0714/09/1317564_575385061.shtml

http://blog.csdn.net/niuwei22007/article/details/49406355

然后

model.add(LSTM(256, 128)) # try using a GRU instead, for fun

model.add(Dropout(0.5))

model.add(Dense(128, 1))

model.add(Activation(‘sigmoid’))

整個流程對應下圖

結果

再看一看keras自帶的例子：imdb_lstm

maxlen = 100

print(“Pad sequences (samples x time)”)

X_train = sequence.pad_sequences(X_train, maxlen=maxlen)

X_test = sequence.pad_sequences(X_test, maxlen=maxlen)

print(‘X_train shape:’, X_train.shape)

print(‘X_test shape:’, X_test.shape)

print(‘Build model…’)

model = Sequential()

model.add(Embedding(max_features, 128))

model.add(LSTM(128, 128)) # try using a GRU instead, for fun

model.add(Dropout(0.5))

model.add(Dense(128, 1))

model.add(Activation(‘sigmoid’))

同樣的道理

如果訓練樣本較少，為了防止模型過擬合，Dropout可以作為一種trikc供選擇。在每個訓練批次中，通過忽略一半的特征檢測器（讓一半的隱層節點值為0），可以明顯地減少過擬合現象。這種方式可以減少特征檢測器間的相互作用，檢測器相互作用是指某些檢測器依賴其他檢測器才能發揮作用。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 keras 文本分類 LSTM LSTM實現中文文本情感分析使用LSTM進行文本情感分析 keras實例學習-雙向LSTM進行imdb情感分類 Keras lstm 文本分類示例用keras實現lstm 利用Keras下的LSTM進行情感分析 pyhanlp 文本分類與情感分析 NLP入門（十）使用LSTM進行文本情感分析基於情感詞典的文本情感分類基於Bert的文本情感分類