torchtext的使用
文本預處理流程:
- file loading
- Tokenization
- Vocab
- Numericalize/Indexify 詞映射成 index
- word vector 詞向量
- Batching
torchtext的處理流程
- torchtext.data.Field 定義樣本處理流程;
- torchtext.data.Datasets 加載corpus
- datasets中,torchtext將corpus處理成一個個 torchtext.data.example;
- 創建 torchtext.data.Example時會調用field.preprocess 方法
- filed.build_vocab 創建詞匯表,將string轉成index;這部分功能包括string token -> index,index -> string token, string token -> word vector;
- torchtext.data.Iterator 將處理后的數據進行batch操作;
- 將Dataset數據batch化;
- pad操作,保證一個batch中的Example長度相同;
- 此處將string token轉成index;
例子
- 首先創建Filed,
from torchtext import data, datasets
SRC = data.Field(tokenize=tokenize_en,pad_token=PAD_WORD)
TGT = data.Field(tokenize=tokenize_en,init_token=BOS,eos_token=EOS,pad_token=PAD)
- 通過dataset加載數據庫
train,val,test=datasets.TranslationDataset.splits(
path="./dataset/",
train="train",
validation="valid",
test="test",
exts=(".src",".tgt"),
fields=(SRC,TGT),
filter_pred=lambda x: len(vars(x)['src'])< MAX_LEN and len(vars(x)['trg'])< MAX_LEN
)
- 建立詞匯表
SRC.build_vocab(train.src,min_freq=MIN_FREQ)
TGT.build_vocab(train.trg,min_freq=MIN_FREQ)
# 這邊可以加載預訓練的詞向量
# TEXT.build_vocab(train,vectors="glove.100d")
- 進行batch操作
train_iter = MyIterator(train,batch_size=BATCH_SIZE, device=0, repeat=False, sort_key=lambda x: (len(x.src), len(x.trg)),batch_size_fn=batch_size_fn, train=True)
# train_iter, val_iter, test_iter = data.Iterator.splits(
# (train, val, test), sort_key=lambda x: len(x.Text),
# batch_sizes=(32, 256, 256), device=-1)
- 獲得單詞的vocab
vocab = TEXT.vocab
self.embed = nn.Embedding(len(vocab), emb_dim)
self.embed.weight.data.copy_(vocab.vectors)
問題
- MyIterator中動態batch和靜態batch的區別
- 如何共享詞表
SRC.build(train.src,train.tgt)
TGT.vocab = SRC.vocab