需要學習鏈接:
使用pandas做預處理,https://blog.csdn.net/mpk_no1/article/details/71698725
https://www.jianshu.com/p/8d3f929c9444
1.想法:
1.首先是要讀取數據集,建立字典,將word轉為id准備輸入;
2.想獲取數據文本的長度分布,然后做截斷,但不知道怎么寫;
但是鏈接中考慮的更全面
1.去掉非ASCII字符,2.去掉換行符,3.轉換為小寫。
https://blog.csdn.net/icbm/article/details/79747024 非ASCII字符:
[^\x00-\x7f]
比如這樣。就是不在ASCII編碼中的字符吧。
其中用到了pandas庫,
2.使用RNN+一層MLP:
class RNN(nn.Module): def __init__(self, num_classes, input_size, hidden_size, num_layers, sequence_length): super(RNN, self).__init__() self.num_classes = num_classes self.num_layers = num_layers self.input_size = input_size self.hidden_size = hidden_size self.sequence_length = sequence_length#1000 self.embedding_size = embedding_size self.embedding = nn.Embedding(input_size, embedding_size)#這里使用的emb_size是200維的。 self.rnn = nn.RNN(input_size=embedding_size, hidden_size=hidden_size, batch_first=True) self.fc = nn.Linear(hidden_size * 2, num_classes) # 相較於之前,又多了一個全連接層 def forward(self, x): # Initialize hidden and cell states # (num_layers * num_directions, batch, hidden_size) for batch_first=True h_0 = self.init_hidden(x.size(0)) embeddings = self.embedding(x) # Reshape input embeddings.view(x.size(0), self.sequence_length, self.embedding_size) # Propagate input through RNN # Input: (batch, seq_len, input_size) # h_0: (num_layers * num_directions, batch, hidden_size) out, _ = self.rnn(embeddings, h_0) # 由於設置了batch_first=True,輸出格式為(batch,seq_length,hidden_size) out=out.permute([1,0,2]) #需要賦值啊親。 out = self.fc(torch.cat((out[0], out[-1]), -1)) return out.view(-1, num_classes) def init_hidden(self,size): return torch.zeros(self.num_layers, size, self.hidden_size).to(device)
trainloss和testloss一直都很高
epoch:0,train loss:0.7623,train accuracy:0.51,test loss 0.8200,test accuracy:0.52,time:32.62 epoch:1,train loss:0.7542,train accuracy:0.53,test loss 0.7367,test accuracy:0.52,time:31.89 epoch:2,train loss:0.7422,train accuracy:0.53,test loss 0.7173,test accuracy:0.51,time:32.06 epoch:3,train loss:0.7572,train accuracy:0.53,test loss 0.7470,test accuracy:0.53,time:31.55 epoch:4,train loss:0.7444,train accuracy:0.53,test loss 0.7474,test accuracy:0.51,time:31.59
3.嘗試加入固定的embedding,glove的100維的
一點問題:
這里我在add_scalar的時候,要求輸入,然后我就用這個zero生成了輸出數據:
writer = SummaryWriter('runs/IMDB_RNN_500/') #simu_input=torch.zeros([batch_size,sequence_length,embedding_size]) #BUG:Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.FloatTensor instead (while checking arguments for embedding) writer.add_graph(model,simu_input)
import torch a=torch.zeros([1,2,3]) print(type(a)) print(a.dtype) #輸出: <class 'torch.Tensor'> torch.float32 #默認為float,而emb需要的是int
a=torch.zeros([1,2,3],dtype=torch.int) print(type(a)) print(a.dtype) #結果: <class 'torch.Tensor'> torch.int32 #這樣設置就ok
torch中的dtypehttps://ptorch.com/news/187.html
最終還是選擇了像這樣對loader進行遍歷iteration,...
dataiter = iter(train_loader) sentences,labels=dataiter.__next__() writer.add_graph(model,sentences.to(device)) runtimeWarning: Iterating over a tensor might cause the trace to be incorrect.
Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results). 'incorrect results).', category=RuntimeWarning) Expected hidden size (1, tensor(32), 100), got (tensor(2), tensor(32), tensor(100)) Error occurs, No graph saved
使用RNN保存了train和test的loss:
反正是損失一直都很高
4.將RNN換為雙向LSTM/GRU
效果不錯啊,雙向LSTM!比單向的RNN好太多了
epoch:0,train accuracy:0.72,test accuracy 0.79,time:30.01 epoch:1,train accuracy:0.83,test accuracy 0.82,time:30.31 epoch:2,train accuracy:0.85,test accuracy 0.84,time:30.29 epoch:3,train accuracy:0.87,test accuracy 0.82,time:29.88 epoch:4,train accuracy:0.89,test accuracy 0.83,time:30.89
就精度各方面都有高,損失函數也在穩步下降,想把兩個curve放到一個里。。但是這里橫軸不一樣,不可。
試下GRU:
epoch:0,train accuracy:0.77,test accuracy 0.76,time:30.42 epoch:1,train accuracy:0.84,test accuracy 0.82,time:30.37 epoch:2,train accuracy:0.86,test accuracy 0.81,time:30.59 epoch:3,train accuracy:0.87,test accuracy 0.82,time:30.74 epoch:4,train accuracy:0.87,test accuracy 0.83,time:29.96
效果也很不錯的。
還是LSTM效果更好一點。
5.CNN做情感分類
這個鏈接里給出了一個CNN的text,所以就用一下,然后它的模型:
進行了實驗:
import torch import torch.nn as nn m=nn.Conv2d(1,1,(3,100))#(輸入通道數,輸出通道數,(kernel_size1,kernel_size2)) pool=nn.MaxPool2d((498,1))#這里是進行pool的(kernel_size1,kenel_size2),沒有什么疑問。 inp=torch.randn(32,1,500,100) a=m(inp) b=pool(a) #結果: >>> a.size() torch.Size([32, 1, 498, 1]) >>> b.size() torch.Size([32, 1, 1, 1])
結果:
epoch:0,train accuracy:0.72,test accuracy 0.77,time:3.59 epoch:1,train accuracy:0.77,test accuracy 0.78,time:3.14 epoch:2,train accuracy:0.79,test accuracy 0.78,time:3.19 epoch:3,train accuracy:0.79,test accuracy 0.80,time:3.20 epoch:4,train accuracy:0.79,test accuracy 0.79,time:3.17
效果還ok,但確實是速度非常地快。
7.將模型保存、讀取模型進行預測,
或者直接寫一個predict函數,其中讀取test文件,然后進行預測,之后結果存儲到文件,然后上傳到kaggle預測啊。
1.這里在預測時每次都是單個的句子,那是否可以用batch_size一次預測32或64個句子呢?
2.這里對句子沒有進行補0,補到500,那么對單個句子可能是適用的,如果是對一個batch的數據呢?是否需要補0?
3.還是說,針對這個預測,只能一句一句地預測?
這樣直接寫入的后果就是:
pred.append((testData['id'][i],label.data.max(1)[1].cpu()))#這里只是轉到cpu上存儲,它的類型還是個tensor #寫入文件中 with open('./data/summit.tsv','w') as f: for data in pred: f.write(str(data[0])+'\t'+str(data[1]))
https://www.kaggle.com/c/word2vec-nlp-tutorial
提交之后是0.82,要求是轉換成csv形式的,那么就這道題來說,該怎么去提高呢?我有以下的幾個想法可以嘗試:
1.首先就是對文本做更多的預處理,參考那一個博客,正則化去掉一些,減少oov吧。
2.對word_embeding做fine tune。
3.嘗試使用其他分類模型,如何text CNN會更快,或者更好?
4.調參。