IMDB情感分類學習

本文轉載自查看原文 2020-03-03 00:26 634 學習記錄

需要學習鏈接：

使用pandas做預處理，https://blog.csdn.net/mpk_no1/article/details/71698725

https://www.jianshu.com/p/8d3f929c9444

1.想法：

1.首先是要讀取數據集，建立字典，將word轉為id准備輸入；

2.想獲取數據文本的長度分布，然后做截斷，但不知道怎么寫；

但是鏈接中考慮的更全面

1.去掉非ASCII字符，2.去掉換行符，3.轉換為小寫。

https://blog.csdn.net/icbm/article/details/79747024 非ASCII字符：

[^\x00-\x7f]

比如這樣。就是不在ASCII編碼中的字符吧。

其中用到了pandas庫，

2.使用RNN+一層MLP：

class RNN(nn.Module):

    def __init__(self, num_classes, input_size, hidden_size, num_layers, sequence_length):
        super(RNN, self).__init__()

        self.num_classes = num_classes
        self.num_layers = num_layers
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.sequence_length = sequence_length#1000
        self.embedding_size = embedding_size

        self.embedding = nn.Embedding(input_size, embedding_size)#這里使用的emb_size是200維的。
        self.rnn = nn.RNN(input_size=embedding_size, hidden_size=hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size * 2, num_classes)  # 相較於之前，又多了一個全連接層

    def forward(self, x):
        # Initialize hidden and cell states
        # (num_layers * num_directions, batch, hidden_size) for batch_first=True
        h_0 = self.init_hidden(x.size(0))
        embeddings = self.embedding(x)
        # Reshape input
        embeddings.view(x.size(0), self.sequence_length, self.embedding_size)  

        # Propagate input through RNN
        # Input: (batch, seq_len, input_size)
        # h_0: (num_layers * num_directions, batch, hidden_size)

        out, _ = self.rnn(embeddings, h_0)  # 由於設置了batch_first=True，輸出格式為(batch,seq_length,hidden_size)
        out=out.permute([1,0,2]) #需要賦值啊親。
        out = self.fc(torch.cat((out[0], out[-1]), -1))
        return out.view(-1, num_classes)

    def init_hidden(self,size):
        return torch.zeros(self.num_layers, size, self.hidden_size).to(device)

trainloss和testloss一直都很高

epoch:0,train loss:0.7623,train accuracy:0.51,test loss 0.8200,test accuracy:0.52,time:32.62
epoch:1,train loss:0.7542,train accuracy:0.53,test loss 0.7367,test accuracy:0.52,time:31.89
epoch:2,train loss:0.7422,train accuracy:0.53,test loss 0.7173,test accuracy:0.51,time:32.06
epoch:3,train loss:0.7572,train accuracy:0.53,test loss 0.7470,test accuracy:0.53,time:31.55
epoch:4,train loss:0.7444,train accuracy:0.53,test loss 0.7474,test accuracy:0.51,time:31.59

3.嘗試加入固定的embedding，glove的100維的

一點問題：

這里我在add_scalar的時候，要求輸入，然后我就用這個zero生成了輸出數據：

writer = SummaryWriter('runs/IMDB_RNN_500/')
#simu_input=torch.zeros([batch_size,sequence_length,embedding_size])
#BUG：Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.FloatTensor instead (while checking arguments for embedding)
writer.add_graph(model,simu_input)

import torch
a=torch.zeros([1,2,3])
print(type(a))
print(a.dtype)

#輸出:
<class 'torch.Tensor'>
torch.float32 #默認為float,而emb需要的是int

a=torch.zeros([1,2,3],dtype=torch.int)
print(type(a))
print(a.dtype)

#結果:
<class 'torch.Tensor'>
torch.int32
#這樣設置就ok

torch中的dtypehttps://ptorch.com/news/187.html

最終還是選擇了像這樣對loader進行遍歷iteration,...

dataiter = iter(train_loader)
sentences,labels=dataiter.__next__()
writer.add_graph(model,sentences.to(device))


runtimeWarning: Iterating over a tensor might cause the trace to be incorrect. 
Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
  'incorrect results).', category=RuntimeWarning)
Expected hidden size (1, tensor(32), 100), got (tensor(2), tensor(32), tensor(100))
Error occurs, No graph saved

使用RNN保存了train和test的loss：

反正是損失一直都很高

4.將RNN換為雙向LSTM/GRU

效果不錯啊，雙向LSTM！比單向的RNN好太多了

epoch:0,train accuracy:0.72,test accuracy 0.79,time:30.01
epoch:1,train accuracy:0.83,test accuracy 0.82,time:30.31
epoch:2,train accuracy:0.85,test accuracy 0.84,time:30.29
epoch:3,train accuracy:0.87,test accuracy 0.82,time:29.88
epoch:4,train accuracy:0.89,test accuracy 0.83,time:30.89

就精度各方面都有高，損失函數也在穩步下降，想把兩個curve放到一個里。。但是這里橫軸不一樣，不可。

試下GRU：

epoch:0,train accuracy:0.77,test accuracy 0.76,time:30.42
epoch:1,train accuracy:0.84,test accuracy 0.82,time:30.37
epoch:2,train accuracy:0.86,test accuracy 0.81,time:30.59
epoch:3,train accuracy:0.87,test accuracy 0.82,time:30.74
epoch:4,train accuracy:0.87,test accuracy 0.83,time:29.96

效果也很不錯的。

還是LSTM效果更好一點。

5.CNN做情感分類

https://github.com/jiajunhua/ShusenTang-Dive-into-DL-PyTorch/blob/master/docs/chapter10_natural-language-processing/10.7_sentiment-analysis-rnn.md

這個鏈接里給出了一個CNN的text，所以就用一下，然后它的模型:

進行了實驗：

import torch
import torch.nn as nn
m=nn.Conv2d(1,1,(3,100))#（輸入通道數，輸出通道數，（kernel_size1,kernel_size2））
pool=nn.MaxPool2d((498,1))#這里是進行pool的(kernel_size1,kenel_size2)，沒有什么疑問。
inp=torch.randn(32,1,500,100)
a=m(inp)
b=pool(a)

#結果：
>>> a.size()
torch.Size([32, 1, 498, 1])
>>> b.size()
torch.Size([32, 1, 1, 1])

結果：

epoch:0,train accuracy:0.72,test accuracy 0.77,time:3.59
epoch:1,train accuracy:0.77,test accuracy 0.78,time:3.14
epoch:2,train accuracy:0.79,test accuracy 0.78,time:3.19
epoch:3,train accuracy:0.79,test accuracy 0.80,time:3.20
epoch:4,train accuracy:0.79,test accuracy 0.79,time:3.17

效果還ok，但確實是速度非常地快。

7.將模型保存、讀取模型進行預測，

或者直接寫一個predict函數，其中讀取test文件，然后進行預測，之后結果存儲到文件，然后上傳到kaggle預測啊。

https://github.com/jiajunhua/ShusenTang-Dive-into-DL-PyTorch/blob/master/docs/chapter10_natural-language-processing/10.7_sentiment-analysis-rnn.md

1.這里在預測時每次都是單個的句子，那是否可以用batch_size一次預測32或64個句子呢？

2.這里對句子沒有進行補0，補到500，那么對單個句子可能是適用的，如果是對一個batch的數據呢？是否需要補0？

3.還是說，針對這個預測，只能一句一句地預測？

這樣直接寫入的后果就是：

    pred.append((testData['id'][i],label.data.max(1)[1].cpu()))#這里只是轉到cpu上存儲，它的類型還是個tensor
    #寫入文件中
    with open('./data/summit.tsv','w') as f:
        for data in pred:
            f.write(str(data[0])+'\t'+str(data[1]))

https://www.kaggle.com/c/word2vec-nlp-tutorial

提交之后是0.82，要求是轉換成csv形式的，那么就這道題來說，該怎么去提高呢？我有以下的幾個想法可以嘗試：

1.首先就是對文本做更多的預處理，參考那一個博客，正則化去掉一些，減少oov吧。

2.對word_embeding做fine tune。

3.嘗試使用其他分類模型，如何text CNN會更快，或者更好？

4.調參。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 keras實例學習-雙向LSTM進行imdb情感分類 tensorflow 2.0 學習（十四）循環神經網絡 IMDB數據集與RNN情感分類網絡深度學習與Pytorch入門實戰（十六）情感分類實戰（基於IMDB數據集） RNN與情感分類問題實戰-加載IMDB數據集文本分類學習（二）文本表示文本分類學習（六） AdaBoost和SVM 基於Keras的imdb數據集電影評論情感二分類評分卡系列（三）：分類學習器的評估機器學習之多分類學習機器學習之多分類學習