文本情感分類 實驗筆記
本實驗為台大李宏毅老師機器學習2020年的HW4【實驗說明】【官方實現代碼參考】【實現代碼】
數據介紹
本次實驗數據為twitter上的推文,每個推文會被標注為正面或負面。其中 0 --> 負面,1 --> 正面
實驗數據共包括,
Labeled training data: 20w條數據

1 1 +++$+++ are wtf ... awww thanks ! 2 1 +++$+++ leavingg to wait for kaysie to arrive myspacin itt for now ilmmthek .! 3 0 +++$+++ i wish i could go and see duffy when she comes to mamaia romania . 4 1 +++$+++ i know eep ! i can ' t wait for one more day .... 5 0 +++$+++ so scared and feeling sick . fuck ! hope someone at hr help ... wish it would be wendita or karen . 6 0 +++$+++ my b day was thurs . i wanted 2 do 5 this weekend for my b day but i guess close enough next weekend . going alone 7 1 +++$+++ e3 is in the trending topics only just noticed ive been tweeting on my iphone until now 8 1 +++$+++ where did you get him from i know someone who would love that ! 9 0 +++$+++ dam just got buzzed by another huge fly ! this time it landed on my head ... not impressed 10 1 +++$+++ tomorrowwwwwwwww !!! you ' ll love tomorrow ' s news ! 11 0 +++$+++ gonna try 2 sleep . damn garageband next to me won ' t let me tho 12 0 +++$+++ wish weekend .. but not really also .. cuz next monday is exam and i haven ' t studied at all yet hate exam .. grr 13 1 +++$+++ check this vid out .... you ' ll piss yourself laughin 14 0 +++$+++ damn you gavin !!!!!! i want my computer back !!!! 15 1 +++$+++ it ' s great that you feel better , fresh air is nice im sure it will help too 16 0 +++$+++ got a bloody wheel clamp yesterday ï ¿ ½150 for 15 mins parking 17 0 +++$+++ homework and summer school . we ' ll go soon though ! 18 0 +++$+++ no it ' s not right ..... it is so wrong ... i would never have expected it 19 1 +++$+++ says sa mga mag gf bf na nag aaway make piece not war 20 1 +++$+++ only has under 200 words left to write on her assignment 21 0 +++$+++ son graduated 5th grade today hes so grown !
Unlabeled training data: 120w條數據 用於半監督學習

1 mkhang mlbo . dami niang followers ee . di q rin naman sia masisisi . desperate n kng desperate , pero dpt tlga replyn nia q = d 2 don ' t you hate it when you hang on to a seemingly interesting movie to see the ending only to find out that the ending sucks ? 3 ok so never went to the movies because friend wasn ' t feeling well but next weekend . back to work today , wasn ' t too bad . 4 can ' t wait to see diversity ' s performance ! 5 i love britney spears haha joey this is what u do go party with eric or do things haha 6 wish i could call in but i can ' t do blogtalk from work 7 1 more day ! 8 nursing celeste with a tummy ache . 9 hates being this burnt !! ouch 10 just couldn ' t sleep last night . working 7a 3p , than dinner with megan . happy bday jl ! 11 i love slaves ! by david raccah , linkedin , rotfl 12 is being super organised and making up orders to post first thing tomorrow ! 13 laying in the bed . it feels soooooo good . what a long day 14 finally , at the airport . currently chilling out at the citibank lounge . maaaan , the wi fi here doesn ' t work ! lameeee ! 15 back and still feeling shattered . still no cockney ... i ' m ashamed to say . 16 so do i 17 don ' t ask me difficult questions , i know how to spell , but not ponder the bigger picture ! 18 hey guys ! i am a big fan too just like my twin lol .. have a good day ! and wishin ya the best of luck ! xd 19 ay dios mio ! 2 weeks left of college !!! ah can ' t wait !! 20 oh , we must be related ! i ' ve heard that line before ! 21 i know , i don ' t know if kayley knows . he ' ll probably be resting again tomorrow , i hope not he ' ll be better . 22 good luck 23 the app never works for me 24 ew , im not that clever , im just lucky what bother you at the class ? the lessons ? 25 whoah crap , that was a mistake ... do not put the three letters together in a tweet im el im .. just got overwhelmed with follow bots . 26 problem with feedburner again . showing no . of feed readers less than actual ones . 27 im having problems don ' t worry 28 listening too mgmt time to pretend
Testing data: 20w條數據(10w public, 10w private)

1 id,text 2 0,my dog ate our dinner . no , seriously ... he ate it . 3 1,omg last day sooon n of primary noooooo x im gona be swimming out of school wif the amount of tears am gona cry 4 2,stupid boys .. they ' re so .. stupid ! 5 3,hi ! do u know if the nurburgring is open for tourists today ? we want to go , but there is an event today 6 4,having lunch in the office , and thinking of how to resolve this discount form issue 7 5,shopping was fun 8 6,wondering where all the nice weather has gone . 9 7,morning ! yeeessssssss new mimi in aug 10 8,umm ... maybe that ' s how the british spell it ? 11 9,yes it ' s 3 : 50 am . yes i ' m still awake . yes i can ' t sleep . yes i ' ll regret it tomorrow . haha i love you mr saturday 12 10,cute heart shaped portal cube . my baby is playing games , im reading fan fictions ! 13 11,had a song on mtv movie awards !!!!! 14 12,thanks nite 15 13,did not start her religion isu i will fail 16 14,that sounds wonderful !! i shall have to try it one day soon ! 17 15,i love ya mariah , i love listening to your songs , your such an inspiration for alot of people out there !!! 18 16,there is sooo much love on here that i could faint ! lol . go celtics !! i miss my b ball team . i ' m proud of you donnie ! 19 17,just found out i ' m gonna be let out early tomorrow , cos we ' re getting the results . omg if i fail science ... 20 18,that was a good thing to wake up to your right we will , and thats why god made us friends !!! ily 21 19,and old cam ' pic of tene and i . goodtimes . heehe . i want my cake now , mum 22 20,ooh my god ! i know the feeling i cannot stand getting into london from harold wood 23 21,nothing ! just kept us there for 20 minutes until they realized a walkie talkie is just a little toy and not a spy tool 24 22,6flags today teexxxt i need to shower but i ' m being lazy . i really don ' t feel that good 25 23,apparently , these are from filming , not the aftermath of the skanky hoebag fans . celebrity sites twisted the truth 26 24,hey fairuz ili ! nice to see some friends here 27 25,also cancelled my nikon 50mm lens order needed to buy some struts and tires for my car ... 28 26,headed to dallas tomorrow ... need some sleep but i am not tired yet !! 29 27,i just found out that i won a shirt from pretty effin sweet , eh ? i wonder what i ' ll get 30 28,sad i didnt get tickets 2 nin ja in albuquerque and it sold out 31 29,has had the most enjoyable day she ' s had for a lonng time 32 30,this should do the trick 33 31,o i have 21 tests i do 10 subjects lucky ... n o right ... kl is it hard ?? 34 32,lol ! i thought so ! have fun in vegas . 35 33,sarah vowell ? if your dad likes humor with his history 36 34,i like corpus
實現步驟
一、 數據預處理
1.1) 讀取數據,包括 train_label_data、train_no_label_data、test_data。放到word2vec模型中(gensim),訓練得到w2v_all.model
1.2) 讀取訓練數據train_label --> input,對 input 中的sentence處理成word embedding的形式 --> train_x:
- 根據 input,制作embedding matrix 及 word和idx的對應字典。記得加上<PAD>和<UNK>
- 將 input 中每個句子中的每個詞都轉為idx,並以一句為單位存到train_x中作為輸入。如果當前詞未出現過,歸為<UNK>
- 根據超參sen_len,對每個sentence進行裁剪及補全(補全則每位補<PAD>),使每個sentence一樣長度
1.3) 將 label 從 str 轉為 int 型 --> y
二、 准備數據
2.1) 將train_x, y 分為訓練集和驗證集: X_train, X_val, y_train, y_val
2.2) 制作 train 和 val 的Dataset 和 DataLoader,方便模型操作shuffle、喂batch等
三、 准備RNN模型
3.1) 新建一個LSTM_Net模型

1 # model.py 2 class LSTM_Net(nn.Module): 3 # 此處的embedding是embedding matrix 4 def __init__(self, embedding, embedding_dim, hidden_dim, num_layers, dropout=0.5, fix_embedding=True): 5 super(LSTM_Net, self).__init__() 6 # embedding layer 7 self.embedding = torch.nn.Embedding(embedding.size(0), embedding.size(1)) 8 self.embedding.weight = torch.nn.Parameter(embedding) 9 10 self.embedding.weight.requires_grad = False if fix_embedding else True 11 self.embedding_dim = embedding.size(1) 12 13 self.hidden_dim = hidden_dim 14 self.num_layers = num_layers 15 self.dropout = dropout 16 self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True) 17 self.classifier = nn.Sequential(nn.Dropout(dropout), 18 nn.Linear(hidden_dim, 1), 19 nn.Sigmoid()) 20 21 def forward(self, inputs): 22 inputs = self.embedding(inputs) 23 x, _ = self.lstm(inputs, None) 24 # x 的 dimension (batch, seq_len, hidden_size) 25 # 取 LSTM 最后一層的hidden state 26 x = x[:, -1, :] 27 x = self.classifier(x) 28 return x
四、進行模型訓練
4.1)model.train() 模式下訓練,model.eval()模式下驗證。與之前圖像CNN的過程類似。
4.2)epoch都訓練完后,保存最后一個epoch中 best_acc 的 model
五、對 test 數據進行預測
5.1) 讀取test數據,並記得做embedding處理
5.2) 把處理后的test數據喂給模型,得到預測結果,保存至csv中。
補充:半監督學習
利用未標注數據。這邊采用一個比較好實現的方法 self-Training
Self-Training:把訓練好的模型對未標注數據做預測。並將這些預測后的值轉成未標注數據的標注,並加入這些新數據做訓練。可調整不同的threshold,或多次取樣得到比較有信心的data。