前一篇文章 用 CNTK 搞深度學習 (一) 入門 介紹了用CNTK構建簡單前向神經網絡的例子。現在假設讀者已經懂得了使用CNTK的基本方法。現在我們做一個稍微復雜一點,也是自然語言挖掘中很火的一個模型: 用遞歸神經網絡構建一個語言模型。
遞歸神經網絡 (RNN),用圖形化的表示則是隱層連接到自己的神經網絡(當然只是RNN中的一種):

不同於普通的神經網絡,RNN假設樣例之間並不是獨立的。例如要預測“上”這個字的下一個字是什么,那么在“上”之前出現過的字就很重要,如果之前出現過“工作”,那么很可能是在說“上班”; 如果之前出前過“家鄉”,那么很可能就是“上海”。 RNN就可以很好的學習出時序的特征。簡單的說,RNN把前一時刻的隱層的值也作為一類feature,作為下一時刻輸入的一部分。
我們這里構建這樣一種language model:給定一個單詞,預測下一個可能出現的單詞。
這個RNN的輸入是dim維的,dim等於詞匯量的大小。輸入向量只有在代表這個單詞的分量上是1,其余為0,即[0,0,0,...0,1,0,...0]。 輸出也是dim維的向量,表示每個單詞出現的概率。
CNTK上構建RNN模型,主要有兩點與普通的神經網絡很不一樣:
(1)輸入格式。 此時輸入的是按句子分開的文本,同一個句子內部的單詞是有順序的。所以輸入要指定成 LMSequenceReader 的格式。 這個格式很麻煩(再吐槽一下,我也不是很懂,就不詳細解釋了,大家可以按照格式自行領悟)
(2) 模型:要使用遞歸模型。 主要是Delay() 函數的使用
一個可用的代碼如下(再次被官方教程坑了好久,現代碼改編自 CNTK-2016-02-08-Windows-64bit-CPU-Only\cntk\Examples\Text\PennTreebank\Config ):
# Parameters can be overwritten on the command line # for example: cntk configFile=myConfigFile RootDir=../.. # For running from Visual Studio add # currentDirectory=$(SolutionDir)/<path to corresponding data folder> RootDir = ".." ConfigDir = "$RootDir$/Config" DataDir = "$RootDir$/Data" OutputDir = "$RootDir$/Output" ModelDir = "$OutputDir$/Models" # deviceId=-1 for CPU, >=0 for GPU devices, "auto" chooses the best GPU, or CPU if no usable GPU is available deviceId = "-1" command = writeWordAndClassInfo:train #command = write precision = "float" traceLevel = 1 modelPath = "$ModelDir$/rnn.dnn" # uncomment the following line to write logs to a file stderr=$OutputDir$/rnnOutput type = double numCPUThreads = 4 confVocabSize = 3000 confClassSize = 50 #trainFile = "ptb.train.txt" trainFile = "review_tokens_split_first5w_lines.txt" #validFile = "ptb.valid.txt" testFile = "review_tokens_split_first10_lines.txt" writeWordAndClassInfo = [ action = "writeWordAndClass" inputFile = "$DataDir$/$trainFile$" outputVocabFile = "$ModelDir$/vocab.txt" outputWord2Cls = "$ModelDir$/word2cls.txt" outputCls2Index = "$ModelDir$/cls2idx.txt" vocabSize = "$confVocabSize$" nbrClass = "$confClassSize$" cutoff = 1 printValues = true ] ####################################### # TRAINING CONFIG # ####################################### train = [ action = "train" minibatchSize = 10 traceLevel = 1 epochSize = 0 recurrentLayer = 1 defaultHiddenActivity = 0.1 useValidation = true rnnType = "CLASSLM" # uncomment below and comment SimpleNetworkBuilder section to use NDL to train RNN LM NDLNetworkBuilder=[ networkDescription="D:\tools\Deep Learning\CNTK-2016-02-08-Windows-64bit-CPU-Only\cntk\Examples\Text\PennTreebank\AdditionalFiles\RNNLM\rnnlm.ndl" ] SGD = [ learningRatesPerSample = 0.1 momentumPerMB = 0 gradientClippingWithTruncation = true clippingThresholdPerSample = 15.0 maxEpochs = 6 unroll = false numMBsToShowResult = 100 gradUpdateType = "none" loadBestModel = true # settings for Auto Adjust Learning Rate AutoAdjust = [ autoAdjustLR = "adjustAfterEpoch" reduceLearnRateIfImproveLessThan = 0.001 continueReduce = false increaseLearnRateIfImproveMoreThan = 1000000000 learnRateDecreaseFactor = 0.5 learnRateIncreaseFactor = 1.382 numMiniBatch4LRSearch = 100 numPrevLearnRates = 5 numBestSearchEpoch = 1 ] dropoutRate = 0.0 ] reader = [ readerType = "LMSequenceReader" randomize = "none" nbruttsineachrecurrentiter = 16 # word class info wordclass = "$ModelDir$/vocab.txt" # if writerType is set, we will cache to a binary file # if the binary file exists, we will use it instead of parsing this file # writerType=BinaryReader # write definition wfile = "$OutputDir$/sequenceSentence.bin" # wsize - inital size of the file in MB # if calculated size would be bigger, that is used instead wsize = 256 # wrecords - number of records we should allocate space for in the file # files cannot be expanded, so this should be large enough. If known modify this element in config before creating file wrecords = 1000 # windowSize - number of records we should include in BinaryWriter window windowSize = "$confVocabSize$" file = "$DataDir$/$trainFile$" # additional features sections # for now store as expanded category data (including label in) features = [ # sentence has no features, so need to set dimension to zero dim = 0 # write definition sectionType = "data" ] # sequence break table, list indexes into sequence records, so we know when a sequence starts/stops sequence = [ dim = 1 wrecords = 2 # write definition sectionType = "data" ] #labels sections labelIn = [ dim = 1 labelType = "Category" beginSequence = "</s>" endSequence = "</s>" # vocabulary size labelDim = "$confVocabSize$" labelMappingFile = "$OutputDir$/sentenceLabels.txt" # Write definition # sizeof(unsigned) which is the label index type elementSize = 4 sectionType = "labels" mapping = [ # redefine number of records for this section, since we don't need to save it for each data record wrecords = 11 # variable size so use an average string size elementSize = 10 sectionType = "labelMapping" ] category = [ dim = 11 # elementSize = sizeof(ElemType) is default sectionType = "categoryLabels" ] ] # labels sections labels = [ dim = 1 labelType = "NextWord" beginSequence = "O" endSequence = "O" # vocabulary size labelDim = "$confVocabSize$" labelMappingFile = "$OutputDir$/sentenceLabels.out.txt" # Write definition # sizeof(unsigned) which is the label index type elementSize = 4 sectionType = "labels" mapping = [ # redefine number of records for this section, since we don't need to save it for each data record wrecords = 3 # variable size so use an average string size elementSize = 10 sectionType = "labelMapping" ] category = [ dim = 3 # elementSize = sizeof(ElemType) is default sectionType = categoryLabels ] ] ] ] write = [ action = "write" outputPath = "$OutputDir$/Write" #outputPath = "-" # "-" will write to stdout; useful for debugging outputNodeNames = "Out,WFeat2Hid,WHid2Hid,WHid2Word" # when processing one sentence per minibatch, this is the sentence posterior #format = [ #sequencePrologue = "log P(W)=" # (using this to demonstrate some formatting strings) #type = "real" #] minibatchSize = 1 # choose this to be big enough for the longest sentence # need to be small since models are updated for each minibatch traceLevel = 1 epochSize = 0 reader = [ # reader to use readerType = "LMSequenceReader" randomize = "none" # BUGBUG: This is ignored. nbruttsineachrecurrentiter = 1 # one sentence per minibatch cacheBlockSize = 1 # workaround to disable randomization # word class info wordclass = "$ModelDir$/vocab.txt" # if writerType is set, we will cache to a binary file # if the binary file exists, we will use it instead of parsing this file # writerType = "BinaryReader" # write definition wfile = "$OutputDir$/sequenceSentence.bin" # wsize - inital size of the file in MB # if calculated size would be bigger, that is used instead wsize = 256 # wrecords - number of records we should allocate space for in the file # files cannot be expanded, so this should be large enough. If known modify this element in config before creating file wrecords = 1000 # windowSize - number of records we should include in BinaryWriter window windowSize = "$confVocabSize$" file = "$DataDir$/$testFile$" # additional features sections # for now store as expanded category data (including label in) features = [ # sentence has no features, so need to set dimension to zero dim = 0 # write definition sectionType = "data" ] #labels sections labelIn = [ dim = 1 # vocabulary size labelDim = "$confVocabSize$" labelMappingFile = "$OutputDir$/sentenceLabels.txt" labelType = "Category" beginSequence = "</s>" endSequence = "</s>" # Write definition # sizeof(unsigned) which is the label index type elementSize = 4 sectionType = "labels" mapping = [ # redefine number of records for this section, since we don't need to save it for each data record wrecords = 11 # variable size so use an average string size elementSize = 10 sectionType = "labelMapping" ] category = [ dim = 11 # elementSize = sizeof(ElemType) is default sectionType = "categoryLabels" ] ] #labels sections labels = [ dim = 1 labelType = "NextWord" beginSequence = "O" endSequence = "O" # vocabulary size labelDim = "$confVocabSize$" labelMappingFile = "$OutputDir$/sentenceLabels.out.txt" # Write definition # sizeof(unsigned) which is the label index type elementSize = 4 sectionType = "labels" mapping = [ # redefine number of records for this section, since we don't need to save it for each data record wrecords = 3 # variable size so use an average string size elementSize = 10 sectionType = "labelMapping" ] category = [ dim = 3 # elementSize = sizeof(ElemType) is default sectionType = "categoryLabels" ] ] ] ]
rnnlm.ndl:
run=ndlCreateNetwork ndlCreateNetwork=[ # vocabulary size featDim=3000 # vocabulary size labelDim=3000 # hidden layer size hiddenDim=200 # number of classes nbrClass=50 initScale=6 features=SparseInput(featDim, tag="feature") # labels in classbasedCrossEntropy is dense and contain 4 values for each sample labels=Input(4, tag="label") # define network WFeat2Hid=Parameter(hiddenDim, featDim, init="uniform", initValueScale=initScale) WHid2Hid=Parameter(hiddenDim, hiddenDim, init="uniform", initValueScale=initScale) # WHid2Word is special that it is hiddenSize X labelSize WHid2Word=Parameter( hiddenDim,labelDim, init="uniform", initValueScale=initScale) WHid2Class=Parameter(nbrClass, hiddenDim, init="uniform", initValueScale=initScale) PastHid = Delay(hiddenDim, HidAfterSig, delayTime=1, needGradient=true) HidFromHeat = Times(WFeat2Hid, features) HidFromRecur = Times(WHid2Hid, PastHid) HidBeforeSig = Plus(HidFromHeat, HidFromRecur) HidAfterSig = Sigmoid(HidBeforeSig) Out = TransposeTimes(WHid2Word, HidAfterSig) #word part ClassProbBeforeSoftmax=Times(WHid2Class, HidAfterSig) cr = ClassBasedCrossEntropyWithSoftmax(labels, HidAfterSig, WHid2Word, ClassProbBeforeSoftmax, tag="criterion") EvalNodes=(Cr) OutputNodes=(Cr) ]
從代碼上看,CNTK會讓人花很大一部分精力在Data Reader上。
writeWordAndClassInfo 是簡單的對所有詞匯做個統計,並對單詞聚類。 這里用的class based RNN,主要是為了加速計算,先把單詞分成不相交的幾類。 這個模塊輸出的文件有4列,分別是單詞索引,出現頻率,單詞,類別。
Train 當然就是訓練模型了,文本量大的話,訓練還是很慢的。
Write 是輸出模塊,注意看這一行: outputNodeNames = "Out,WFeat2Hid,WHid2Hid,WHid2Word"
我想最多人關心的應該是對於一個句子,運行這個訓練好的RNN之后,如何得到隱層的值吧? 我的做法是把訓練好的RNN的參數給保存下來,然后...然后無論是用java還是用python的人,都能根據這個參數還原一個RNN網絡,然后我們想干嘛就能干嘛了。
Train中我是用了自己定義的模型:NDLNetworkBuilder 。 也可以用通用的遞歸模型,此時只要簡單地規定一個參數就行了,例如
SimpleNetworkBuilder=[ trainingCriterion=classcrossentropywithsoftmax evalCriterion=classcrossentropywithsoftmax nodeType=Sigmoid initValueScale=6.0 layerSizes=10000:200:10000 addPrior=false addDropoutNodes=false applyMeanVarNorm=false uniformInit=true; # these are for the class information for class-based language modeling vocabSize=10000 nbrClass=50 ]
我這里使用自己定義的網絡,主要是為了日后想改成LSTM結構。
原創博客,未經允許,請勿轉載。
