Torchtext使用教程 文本數據處理


Torchtext

文本數據預處理工具

Doc | Code

Field

定義數據處理的方式,將原始數據轉為TENSOR

Field使用

from torchtext import data

TEXT = data.Field(sequential=True, tokenize=tokenize, lower=True, fix_length=200)
LABEL = data.Field(sequential=False, use_vocab=False)

Field參數

參數名 說明
sequential Default: True 是否是序列數據,如果不是就不使用tokenization
use_vocab Default: True 是否使用a Vocab object.如果不使用的話,原始數據應已是數字類型.
init_token Default: None A token that will be prepended to every example using this field, or None for no initial token.
eos_token A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
fix_length Default: None. 設置序列數據的定長 eg. 100
dtype The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
preprocessing The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.
postprocessing A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned into a Tensor. The pipeline function takes the batch as a list, and the field’s Vocab. Default: None.
lower Default: False. 字符串轉為小寫
tokenize Default: string.split 對原始數據進行字符串操作,eg. 輸入tokenize = lambda x: x.split()
tokenizer_language The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
include_lengths Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.
batch_first Default: False 是否返回batch維度在第一個維度的數據
pad_token The string token used as padding. Default: “ ”.
unk_token The string token used to represent OOV words. Default: “ ”.
pad_first Do the padding of the sequence at the beginning. Default: False.
truncate_first Do the truncating of the sequence at the beginning. Default: False
stop_words Tokens to discard during the preprocessing step. Default: None
is_target Whether this field is a target variable. Affects iteration over batches. Default: False

Dataset

使用Field來定義數據組成形式,得到數據集

Dataset使用

自定義Dataset類

from torchtext import data
import random
import numpy as np
class MyDataset(data.Dataset):
    def __init__(self, csv_path, text_field, label_field, test=False, aug=False, **kwargs):
        
        csv_data = pd.read_csv(csv_path)
        
        # 數據處理操作格式
        fields = [("id", None),("text", text_field), ("label", label_field)]
        
        examples = []
        if test:
            # 如果為測試集,則不加載標簽
            for text in tqdm(csv_data['text']):
                examples.append(data.Example.fromlist([None, text, None], fields))
        else:
            for text, label in tqdm(zip(csv_data['text'], csv_data['label'])):
                # 數據增強
                if aug:
                    rate = random.random()
                    if rate > 0.5:
                        text = self.dropout(text)
                    else:
                        text = self.shuffle(text)
                examples.append(data.Example.fromlist([None, text, label], fields))
                
        # 上面是一些預處理操作,此處調用super調用父類構造方法,產生標准Dataset
        # super(MyDataset, self).__init__(examples, fields, **kwargs)
        super(MyDataset, self).__init__(examples, fields)

    def shuffle(self, text):
        # 序列隨機排序
        text = np.random.permutation(text.strip().split())
        return ' '.join(text)

    def dropout(self, text, p=0.5):
        # 隨機刪除一些文本
        text = text.strip().split()
        len_ = len(text)
        indexs = np.random.choice(len_, int(len_ * p))
        for i in indexs:
            text[i] = ''
        return ' '.join(text)

Iterator

迭代器 Iterator / BucketIterator

Iterator

保持數據樣本順序不變來構建批數據

BucketIterator

自動選取樣本長度相似的數據來構建批數據,最大程度地減少所需的填充量

from torchtext import data
def data_iter(train_path, valid_path, test_path, TEXT, LABEL):
    train = MyDataset(train_path, text_field=TEXT, label_field=LABEL, test=False, aug=1)
    valid = MyDataset(valid_path, text_field=TEXT, label_field=LABEL, test=False, aug=1)
    test = MyDataset(test_path, text_field=TEXT, label_field=None, test=True, aug=1)
    # 傳入用於構建詞表的數據集
    # TEXT = data.Field(sequential=True, tokenize=tokenize, lower=True, fix_length=200)
    TEXT.build_vocab(train)
    weight_matrix = TEXT.vocab.vectors
    # 只針對訓練集構造迭代器
    # train_iter = data.BucketIterator(dataset=train, batch_size=8, shuffle=True, sort_within_batch=False, repeat=False)
    
    # 同時對訓練集和驗證集構造迭代器
    train_iter, val_iter = data.BucketIterator.splits(
            (train, valid),
            batch_sizes=(8, 8),
            # 如果使用gpu,此處將-1更換為GPU的編號
            device=-1,
            # 用來排序的指標
            sort_key=lambda x: len(x.text),
            sort_within_batch=False,
            repeat=False
    )
    test_iter = Iterator(test, batch_size=8, device=-1, sort=False, sort_within_batch=False, repeat=False)
    return train_iter, val_iter, test_iter, weight_matrix

Word Embedding

在使用pytorch或tensorflow等神經網絡框架進行nlp任務的處理時,可以通過對應的Embedding層做詞向量的處理。使用預訓練好的詞向量會帶來更優的性能,下面介紹如何在torchtext中使用預訓練的詞向量,進而傳送給神經網絡模型進行訓練。

torchtext 默認支持的預訓練詞向量

自動下載對應的預訓練詞向量文件到當前文件夾下的.vector_cache目錄下,.vector_cache為默認的詞向量文件和緩存文件的目錄。

from torchtext.vocab import GloVe
from torchtext import data
TEXT = data.Field(sequential=True)
# 以下兩種指定預訓練詞向量的方式等效
# TEXT.build_vocab(train, vectors="glove.6B.200d")
TEXT.build_vocab(train, vectors=GloVe(name='6B', dim=300))
# 在這種情況下,會默認下載glove.6B.zip文件,進而解壓出glove.6B.50d.txt, glove.6B.100d.txt

外部預訓練的詞向量

通過name參數指定預訓練文件,通過cache參數指定預訓練文件目錄

cache = '.vector_cache'
vectors = Vectors(name='myvector/glove/glove.6B.200d.txt', cache=cache)
TEXT.build_vocab(train, vectors=vectors)

在模型中指定Embedding層參數

import torch.nn as nn
# pytorch創建的Embedding層
embedding = nn.Embedding(input_dim, hidden_dim)
# 權重在詞匯表vocab的vectors屬性中
weight_matrix = TEXT.vocab.vectors
# 指定嵌入矩陣的初始權重
embedding.weight.data.copy_(weight_matrix)


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM