用CNTK搞深度學習（二）訓練基於RNN的自然語言模型 ( language model )

本文轉載自查看原文 2016-04-22 00:08 15214 RNN/ 語言模型/ language model/ CNTK

前一篇文章用 CNTK 搞深度學習（一）入門介紹了用CNTK構建簡單前向神經網絡的例子。現在假設讀者已經懂得了使用CNTK的基本方法。現在我們做一個稍微復雜一點，也是自然語言挖掘中很火的一個模型：用遞歸神經網絡構建一個語言模型。

遞歸神經網絡（RNN），用圖形化的表示則是隱層連接到自己的神經網絡（當然只是RNN中的一種）：

不同於普通的神經網絡，RNN假設樣例之間並不是獨立的。例如要預測“上”這個字的下一個字是什么，那么在“上”之前出現過的字就很重要，如果之前出現過“工作”，那么很可能是在說“上班”; 如果之前出前過“家鄉”，那么很可能就是“上海”。 RNN就可以很好的學習出時序的特征。簡單的說，RNN把前一時刻的隱層的值也作為一類feature，作為下一時刻輸入的一部分。

我們這里構建這樣一種language model：給定一個單詞，預測下一個可能出現的單詞。

這個RNN的輸入是dim維的，dim等於詞匯量的大小。輸入向量只有在代表這個單詞的分量上是1，其余為0，即[0,0,0,...0,1,0,...0]。輸出也是dim維的向量，表示每個單詞出現的概率。

CNTK上構建RNN模型，主要有兩點與普通的神經網絡很不一樣：

（1）輸入格式。此時輸入的是按句子分開的文本，同一個句子內部的單詞是有順序的。所以輸入要指定成 LMSequenceReader 的格式。這個格式很麻煩（再吐槽一下，我也不是很懂，就不詳細解釋了，大家可以按照格式自行領悟）

（2）模型：要使用遞歸模型。主要是Delay() 函數的使用

一個可用的代碼如下（再次被官方教程坑了好久，現代碼改編自 CNTK-2016-02-08-Windows-64bit-CPU-Only\cntk\Examples\Text\PennTreebank\Config ）：

# Parameters can be overwritten on the command line
# for example: cntk configFile=myConfigFile RootDir=../.. 
# For running from Visual Studio add
# currentDirectory=$(SolutionDir)/<path to corresponding data folder> 
RootDir = ".."

ConfigDir = "$RootDir$/Config"
DataDir = "$RootDir$/Data"
OutputDir = "$RootDir$/Output"
ModelDir = "$OutputDir$/Models"

# deviceId=-1 for CPU, >=0 for GPU devices, "auto" chooses the best GPU, or CPU if no usable GPU is available
deviceId = "-1"

command = writeWordAndClassInfo:train
#command = write

precision = "float"
traceLevel = 1
modelPath = "$ModelDir$/rnn.dnn"

# uncomment the following line to write logs to a file
stderr=$OutputDir$/rnnOutput

type = double
numCPUThreads = 4

confVocabSize = 3000
confClassSize = 50

#trainFile = "ptb.train.txt"
trainFile = "review_tokens_split_first5w_lines.txt"
#validFile = "ptb.valid.txt"
testFile = "review_tokens_split_first10_lines.txt"

writeWordAndClassInfo = [
    action = "writeWordAndClass"
    inputFile = "$DataDir$/$trainFile$"
    outputVocabFile = "$ModelDir$/vocab.txt"
    outputWord2Cls = "$ModelDir$/word2cls.txt"
    outputCls2Index = "$ModelDir$/cls2idx.txt"
    vocabSize = "$confVocabSize$"
    nbrClass = "$confClassSize$"
    cutoff = 1
    printValues = true
]

#######################################
#  TRAINING CONFIG                    #
#######################################

train = [
    action = "train"
    minibatchSize = 10
    traceLevel = 1
    epochSize = 0
    recurrentLayer = 1
    defaultHiddenActivity = 0.1
    useValidation = true
    rnnType = "CLASSLM"

     # uncomment below and comment SimpleNetworkBuilder section to use NDL to train RNN LM
     NDLNetworkBuilder=[
        networkDescription="D:\tools\Deep Learning\CNTK-2016-02-08-Windows-64bit-CPU-Only\cntk\Examples\Text\PennTreebank\AdditionalFiles\RNNLM\rnnlm.ndl"
     ]
  

    SGD = [
        learningRatesPerSample = 0.1
        momentumPerMB = 0
        gradientClippingWithTruncation = true
        clippingThresholdPerSample = 15.0
        maxEpochs = 6
        unroll = false
        numMBsToShowResult = 100
        gradUpdateType = "none"
        loadBestModel = true

        # settings for Auto Adjust Learning Rate
        AutoAdjust = [
            autoAdjustLR = "adjustAfterEpoch"
            reduceLearnRateIfImproveLessThan = 0.001
            continueReduce = false
            increaseLearnRateIfImproveMoreThan = 1000000000
            learnRateDecreaseFactor = 0.5
            learnRateIncreaseFactor = 1.382
            numMiniBatch4LRSearch = 100
            numPrevLearnRates = 5
            numBestSearchEpoch = 1
        ]

        dropoutRate = 0.0
    ]

    reader = [
        readerType = "LMSequenceReader"
        randomize = "none"
        nbruttsineachrecurrentiter = 16

        # word class info
        wordclass = "$ModelDir$/vocab.txt"

        # if writerType is set, we will cache to a binary file
        # if the binary file exists, we will use it instead of parsing this file
        # writerType=BinaryReader

        # write definition
        wfile = "$OutputDir$/sequenceSentence.bin"
        
        # wsize - inital size of the file in MB
        # if calculated size would be bigger, that is used instead
        wsize = 256

        # wrecords - number of records we should allocate space for in the file
        # files cannot be expanded, so this should be large enough. If known modify this element in config before creating file
        wrecords = 1000
        
        # windowSize - number of records we should include in BinaryWriter window
        windowSize = "$confVocabSize$"

        file = "$DataDir$/$trainFile$"

        # additional features sections
        # for now store as expanded category data (including label in)
        features = [
            # sentence has no features, so need to set dimension to zero
            dim = 0
            # write definition
            sectionType = "data"
        ]
      
        # sequence break table, list indexes into sequence records, so we know when a sequence starts/stops
        sequence = [
            dim = 1
            wrecords = 2
            # write definition
            sectionType = "data"
        ]
        
        #labels sections
        labelIn = [
            dim = 1
            labelType = "Category"
            beginSequence = "</s>"
            endSequence = "</s>"

            # vocabulary size
            labelDim = "$confVocabSize$"
            labelMappingFile = "$OutputDir$/sentenceLabels.txt"
            
            # Write definition
            # sizeof(unsigned) which is the label index type
            elementSize = 4
            sectionType = "labels"
            mapping = [
                # redefine number of records for this section, since we don't need to save it for each data record
                wrecords = 11                
                # variable size so use an average string size
                elementSize = 10
                sectionType = "labelMapping"
            ]
            
            category = [
                dim = 11
                # elementSize = sizeof(ElemType) is default
                sectionType = "categoryLabels"
            ]
        ]
        
        # labels sections
        labels = [
            dim = 1
            labelType = "NextWord"
            beginSequence = "O"
            endSequence = "O"

            # vocabulary size
            labelDim = "$confVocabSize$"
            labelMappingFile = "$OutputDir$/sentenceLabels.out.txt"
            
            # Write definition 
            # sizeof(unsigned) which is the label index type
            elementSize = 4
            sectionType = "labels"
            mapping = [
                # redefine number of records for this section, since we don't need to save it for each data record
                wrecords = 3
                # variable size so use an average string size
                elementSize = 10
                sectionType = "labelMapping"
            ]
            
            category = [
                dim = 3
                # elementSize = sizeof(ElemType) is default
                sectionType = categoryLabels
            ]
        ]
    ] 
]



write = [
    action = "write"

    outputPath = "$OutputDir$/Write"
    #outputPath = "-"                    # "-" will write to stdout; useful for debugging
    outputNodeNames = "Out,WFeat2Hid,WHid2Hid,WHid2Word" # when processing one sentence per minibatch, this is the sentence posterior
    #format = [
        #sequencePrologue = "log P(W)="    # (using this to demonstrate some formatting strings)
        #type = "real"
    #]

    minibatchSize = 1              # choose this to be big enough for the longest sentence
    # need to be small since models are updated for each minibatch
    traceLevel = 1
    epochSize = 0

    reader = [
        # reader to use
        readerType = "LMSequenceReader"
        randomize = "none"              # BUGBUG: This is ignored.
        nbruttsineachrecurrentiter = 1  # one sentence per minibatch
        cacheBlockSize = 1              # workaround to disable randomization

        # word class info
        wordclass = "$ModelDir$/vocab.txt"

        # if writerType is set, we will cache to a binary file
        # if the binary file exists, we will use it instead of parsing this file
        # writerType = "BinaryReader"

        # write definition
        wfile = "$OutputDir$/sequenceSentence.bin"
        # wsize - inital size of the file in MB
        # if calculated size would be bigger, that is used instead
        wsize = 256

        # wrecords - number of records we should allocate space for in the file
        # files cannot be expanded, so this should be large enough. If known modify this element in config before creating file
        wrecords = 1000
        
        # windowSize - number of records we should include in BinaryWriter window
        windowSize = "$confVocabSize$"

        file = "$DataDir$/$testFile$"

        # additional features sections
        # for now store as expanded category data (including label in)
        features = [
            # sentence has no features, so need to set dimension to zero
            dim = 0
            # write definition
            sectionType = "data"
        ]
        
        #labels sections
        labelIn = [
            dim = 1

            # vocabulary size
            labelDim = "$confVocabSize$"
            labelMappingFile = "$OutputDir$/sentenceLabels.txt"
            
            labelType = "Category"
            beginSequence = "</s>"
            endSequence = "</s>"

            # Write definition
            # sizeof(unsigned) which is the label index type
            elementSize = 4
            sectionType = "labels"
            
            mapping = [
                # redefine number of records for this section, since we don't need to save it for each data record
                wrecords = 11
                # variable size so use an average string size
                elementSize = 10
                sectionType = "labelMapping"
            ]
            
            category = [
                dim = 11
                # elementSize = sizeof(ElemType) is default
                sectionType = "categoryLabels"
            ]
        ]
        
        #labels sections
        labels = [
            dim = 1
            labelType = "NextWord"
            beginSequence = "O"
            endSequence = "O"

            # vocabulary size
            labelDim = "$confVocabSize$"

            labelMappingFile = "$OutputDir$/sentenceLabels.out.txt"
            # Write definition
            # sizeof(unsigned) which is the label index type
            elementSize = 4
            sectionType = "labels"
            
            mapping = [
                # redefine number of records for this section, since we don't need to save it for each data record
                wrecords = 3
                # variable size so use an average string size
                elementSize = 10
                sectionType = "labelMapping"
            ]
            
            category = [
                dim = 3
                # elementSize = sizeof(ElemType) is default
                sectionType = "categoryLabels"
            ]
        ]
    ]
]

rnnlm.ndl:

run=ndlCreateNetwork

ndlCreateNetwork=[
    # vocabulary size
    featDim=3000
    # vocabulary size
    labelDim=3000
    # hidden layer size
    hiddenDim=200
    # number of classes
    nbrClass=50
    
    initScale=6
    
    features=SparseInput(featDim, tag="feature")
    
    # labels in classbasedCrossEntropy is dense and contain 4 values for each sample
    labels=Input(4, tag="label")

    # define network
    WFeat2Hid=Parameter(hiddenDim, featDim, init="uniform", initValueScale=initScale)
    WHid2Hid=Parameter(hiddenDim, hiddenDim, init="uniform", initValueScale=initScale)

    # WHid2Word is special that it is hiddenSize X labelSize
    WHid2Word=Parameter( hiddenDim,labelDim,  init="uniform", initValueScale=initScale)
     WHid2Class=Parameter(nbrClass, hiddenDim, init="uniform", initValueScale=initScale)
   
    PastHid = Delay(hiddenDim, HidAfterSig, delayTime=1, needGradient=true)    
    HidFromHeat = Times(WFeat2Hid, features)
    HidFromRecur = Times(WHid2Hid, PastHid)
    HidBeforeSig = Plus(HidFromHeat, HidFromRecur)
    HidAfterSig = Sigmoid(HidBeforeSig)
    
    Out = TransposeTimes(WHid2Word, HidAfterSig)  #word part
    
    ClassProbBeforeSoftmax=Times(WHid2Class, HidAfterSig)
    
    cr = ClassBasedCrossEntropyWithSoftmax(labels, HidAfterSig, WHid2Word, ClassProbBeforeSoftmax, tag="criterion")
    EvalNodes=(Cr)
    OutputNodes=(Cr)
]

從代碼上看，CNTK會讓人花很大一部分精力在Data Reader上。

writeWordAndClassInfo 是簡單的對所有詞匯做個統計，並對單詞聚類。 這里用的class based RNN，主要是為了加速計算，先把單詞分成不相交的幾類。 這個模塊輸出的文件有4列，分別是單詞索引，出現頻率，單詞，類別。
Train 當然就是訓練模型了，文本量大的話，訓練還是很慢的。
Write 是輸出模塊，注意看這一行：  outputNodeNames = "Out,WFeat2Hid,WHid2Hid,WHid2Word"

我想最多人關心的應該是對於一個句子，運行這個訓練好的RNN之后，如何得到隱層的值吧？我的做法是把訓練好的RNN的參數給保存下來，然后...然后無論是用java還是用python的人，都能根據這個參數還原一個RNN網絡，然后我們想干嘛就能干嘛了。

Train中我是用了自己定義的模型：NDLNetworkBuilder 。也可以用通用的遞歸模型，此時只要簡單地規定一個參數就行了，例如

SimpleNetworkBuilder=[
        trainingCriterion=classcrossentropywithsoftmax
        evalCriterion=classcrossentropywithsoftmax
        nodeType=Sigmoid
        initValueScale=6.0
        layerSizes=10000:200:10000
        addPrior=false
        addDropoutNodes=false
        applyMeanVarNorm=false
        uniformInit=true;

        # these are for the class information for class-based language modeling
        vocabSize=10000
        nbrClass=50
    ]

我這里使用自己定義的網絡，主要是為了日后想改成LSTM結構。

原創博客，未經允許，請勿轉載。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 自然語言的分詞方法之N-gram語言模型 pytorch ---神經網絡語言模型 NNLM 《A Neural Probabilistic Language Model》《Enhanced LSTM for Natural Language Inference》（自然語言推理）使用SRILM訓練大的語言模型 [sphinx]中文語言模型訓練 [轉]語言模型訓練工具SRILM kenlm訓練ngram語言模型 language model ——tensorflow 之RNN （六）語言模型 Language Madel 與 word2vec 【自然語言處理】：n-gram模型深度理解

用CNTK搞深度學習 （二） 訓練基於RNN的自然語言模型 ( language model )

免責聲明！

用CNTK搞深度學習（二）訓練基於RNN的自然語言模型 ( language model )