【新人賽】阿里雲惡意程序檢測 -- 實踐記錄 11.24 - word2vec模型 + xgboost


使用word2vec訓練詞向量

使用word2vec無監督學習訓練詞向量,輸入的是訓練數據和測試數據,輸出的是每個詞的詞向量,總共三百個詞左右。

求和:然后再將每行數據中的每個詞的詞向量加和,得到每行的詞向量表示。

其他還可以通過求平均,求眾數或者最大值等等方法得到每行的詞向量表示。

代碼如下:

import time
import csv
import pickle
import numpy as np
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models.word2vec import Word2Vec
import warnings

warnings.filterwarnings('ignore')  # 忽略警告
with open("security_train.csv.pkl", "rb") as f:
    labels = pickle.load(f)
    files = pickle.load(f)

with open("security_test.csv.pkl", "rb") as f:
    file_names = pickle.load(f)
    outfiles = pickle.load(f)

訓練詞向量模型的方法:

def train_w2v_model(files, size, model, flag):
  for batch in range(int(len(files)/size) + 1):
    sentences = []
    print("batch:", batch)
    if batch != int(len(files)/size):
      for i in range(batch*size, size*(batch+1)):
        sentence = files[i].split(' ')
        sentences.append(sentence)
    else:
      for i in range(size*(batch+1), len(files)):
        sentence = files[i].split(' ')
        sentences.append(sentence)

    sentences = np.array(sentences)

    if batch == 0 and flag == True:
      model.build_vocab(sentences)
    else:
      model.build_vocab(sentences, update=True)

    model.train(sentences, total_examples = model.corpus_count, epochs = model.epochs)

  print("done.")
  return model
# 訓練詞向量
model = Word2Vec()
model = train_w2v_model(files, 1000, model, True)
model = train_w2v_model(outfiles, 1000, model, False)
model.save('./temp/w2cmodel_train_test')
# model = Word2Vec.load('./temp/w2cmodel0')
print(model)

對每行數據求詞向量之和的方法:

def train_sum_vec(files, model, size=100):
  rtvec = []
  for i in range(len(files)):
    if i % 100 == 0: 
      print(i)
    text = files[i].split(' ')
    # 對每個句子的詞向量進行求和計算
    vec = np.zeros(size).reshape((1, size))
    for word in text:
      try:
        vec += model[word].reshape((1, size))
      except KeyError:
        continue
    rtvec.append(vec)
  
  train_vec = np.concatenate(rtvec)
  return train_vec

得到訓練數據的詞向量:

# 將詞向量保存為 Ndarray
train_vec = train_sum_vec(files, model)
# 保存 Word2Vec 模型及詞向量
model.save('w2v_model.pkl')
np.save('X_train_test_vec.npy', train_vec)
print('done.')

得到測試數據的詞向量:

test_vec = train_sum_vec(outfiles, model)
np.save('y_test_vec.npy', test_vec)
print('done.')

xgboost訓練:

meta_train = np.zeros(shape=(len(files), 8))
meta_test = np.zeros(shape=(len(outfiles), 8))

k = 10
skf = StratifiedKFold(n_splits=k, random_state=42, shuffle=True)
X_vector = np.load('X_train_test_vec.npy')
y_vector = np.load('y_test_vec.npy')
for i, (tr_ind, te_ind) in enumerate(skf.split(X_vector, labels)):
    X_train, X_train_label = X_vector[tr_ind], labels[tr_ind]
    X_val, X_val_label = X_vector[te_ind], labels[te_ind]

    print('FOLD: {}'.format(str(i)))
    print(len(tr_ind), len(te_ind))
    
    dtrain = xgb.DMatrix(X_train, label=X_train_label)
    dtest = xgb.DMatrix(X_val, X_val_label)
    dout = xgb.DMatrix(y_vector)
    
    param = {'max_depth': 6, 'eta': 0.1, 'eval_metric': 'mlogloss', 'silent': 1, 'objective': 'multi:softprob',
             'num_class': 8, 'subsample': 0.8, 'colsample_bytree': 0.85}

    evallist = [(dtrain, 'train'), (dtest, 'val')]  # 測試 , (dtrain, 'train')
    num_round = 300  # 循環次數
    bst = xgb.train(param, dtrain, num_round, evallist, early_stopping_rounds=50)

    # dtr = xgb.DMatrix(train_features)
    pred_val = bst.predict(dtest)
    pred_test = bst.predict(dout)
    meta_train[te_ind] = pred_val
    meta_test += pred_test
    
meta_test /= 10.0
with open("word2vec_result_{}.pkl".format(
        str(time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime()))),
        'wb') as f:
    pickle.dump(meta_train, f)
    pickle.dump(meta_test, f)
result = meta_test
out = []

for i in range(len(file_names)):
    tmp = []
    a = result[i].tolist()
    tmp.append(file_names[i])
    tmp.extend(a)
    out.append(tmp)
    
with open("word2vec_10k_{}.csv".format(
        str(time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime()))),
        "w",
        newline='') as csvfile:
    writer = csv.writer(csvfile)

    # 先寫入columns_name
    writer.writerow(["file_id", "prob0", "prob1", "prob2", "prob3", "prob4", "prob5", "prob6", "prob7"])
    # 寫入多行用writerows
    writer.writerows(out)

提交到線上得到的結果為,0.725923

使用詞向量的平均值,提交到線上結果為,0.751533

數據增強后,結果為,0.711533


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM