Tacotron2
前置知識
通過時域到頻域的變換,可以得到從側面看到的頻譜,但是這個頻譜並沒有包含時域的中全部的信息,因為頻譜只代表各個頻率正弦波的振幅是多少,而沒有提到相位。基礎的正弦波\(Asin(wt+\theta)\)中,振幅、頻率和相位缺一不可。不同相位決定了波的位置,所以對於頻域分析,僅有頻譜是不夠的,還需要一個相位譜。
-
時域譜:時間-振幅
-
頻域譜:頻率-振幅
-
相位譜:相位-振幅
參見:傅里葉分析之掐死教程(完整版)更新於2014.06.06
傳統語音合成:
- 單元挑選和拼接:將事先錄制好的語音波形小片段縫合在一起。邊界人工痕跡明顯
- 統計參數:直接合成語音特征的平滑軌跡,交由聲碼器合成語音。發音模糊不清且不自然
Tacotron2分為兩部分:
- 一個seq2seq結構的特征預測網絡,將字符向量映射到梅爾聲譜圖
- 一個WaveNet修訂版,將梅爾聲譜圖合成為時域波形
梅爾頻譜是對短時傅里葉變換獲得的聲譜(即線性聲譜)頻率軸施加一個非線性變換,其依據人耳特性:低頻細節對語音的理解十分關鍵,而高頻細節可以淡化,對頻率壓縮變換而得。Tacotron2使用低層的聲學特征梅爾聲譜圖來銜接兩個部分的原因:
- 梅爾頻譜容易通過時域波形計算得到
- 梅爾頻譜對於每一幀都是相位不變的,容易使用均方差(MSE)訓練
梅爾聲譜拋棄了相位信息,而像Griffin-Lim算法對拋棄的相位信息進行估計,然后用一個短時傅里葉逆變換將梅爾聲譜圖轉換為時域波形。可就是說,梅爾聲譜是有損的,而聲碼器Griffin-Lim算法和近年出現的WaveNet有“找補”作用。Tacotron2特征預測網絡選擇梅爾頻譜作為輸出以減輕模型負擔,再另加一個聲碼器WaveNet單獨將梅爾頻譜轉波形。
聲譜預測網絡
編碼器將字符序列轉化為一個隱狀態,繼而解碼器接受隱狀態用以預測聲譜圖。構建注意力網絡用以消費編碼器輸出結果,編碼器的每次輸出,注意力網絡都將編碼序列歸納為上下文向量。最小化進入post-net前后的均方差以加速收斂。stop-token用於在推斷(inference)時,動態結束生成過程。
WaveNet聲碼器
兩部分組件分開訓練,WaveNet依賴Tacotron2的特征預測網絡的結果進行預測,一個替代方案是,使用從真實音頻中抽取的梅爾頻譜來訓練WaveNet。注意:使用預測特征訓練,就用預測特征推斷;真實特征訓練,真實特征推斷;諸如使用預測特征訓練,使用真實特征預測這種方法合成效果最差。
另外,論文中提到,梅爾頻譜的通道數和合成語音質量是一種有趣的權衡;在解碼后使用后處理網絡結合上下文以改善合成質量,雖然WaveNet也含有卷積,但通過對比合成質量,加了后處理網絡的合成質量更好些;與修改之前的30層卷積,256ms感受野的WaveNet相比,修改后的WaveNet卷積層減少到12層,感受野為10.5ms,但模型仍然合成了高質量的語音。因此對於語音質量來說,一個大的感受野並不是必須的,但是,如果去除所有的擴大卷積,感受野與基線模型相比小兩個數量級,語音質量也大幅下降。因此雖然模型不需要一個大感受野,但是適當的上下文是必需的。
實現代碼
以Tacotron-2_Rayhane-mamah@github為例,說明Tacotron2特征預測部分的實現。
特征預測網絡的主文件位於Tacotron-2/tacotron/models/tacotron.py
-
Character Embedding,上圖中,Input Text和Char Embedding:將Input Text映射到實數向量
[batch_size, sequence_length] -> [batch_size, sequence_length, embedding_size]
e.g.:
\[[ [2,4], [3] ]\\ \to \\ [[[0.3,0.1,0.5,0.9], [0.5,0.,1.9,0.3,0.4]], [[1.3,0.4,5.1,0.8]] \]embedding_table = tf.get_variable( 'inputs_embedding', [len(symbols), hp.embedding_dim], dtype=tf.float32) embedded_inputs = tf.nn.embedding_lookup(embedding_table, inputs)
-
embedding_table: [len(symbols), embedding_size]. Tensorflow待訓練變量,len(symbols)為所有字符的數目
-
inputs: [batch_size, sequence_length]. sequence_length為輸入時間序列的步數,矩陣中的值為字符ID
-
embedded_inputs: [batch_size, sequence_length, embedding_size]
-
-
Encoder,上圖中,3 Conv Layers和Bidirectional LSTM:編碼器
[batch_size, sequece_length, embedding_size] -> [batch_size, encoder_steps, encoder_lstm_units]
encoder_cell = TacotronEncoderCell( EncoderConvolutions(is_training, hparams=hp, scope='encoder_convolutions'), EncoderRNN(is_training, size=hp.encoder_lstm_units, zoneout=hp.tacotron_zoneout_rate, scope='encoder_LSTM')) encoder_outputs = encoder_cell(embedded_inputs, input_lengths)
-
encoder_outputs: [batch_size, encoder_steps, encoder_lstm_units]
其中,TacotronEncoderCell和BasicRNNCell、GRUCell、BasicLSTMCell一樣,是自定義的RNNCell,繼承自from tensorflow.contrib.rnn import RNNCell,參見:tensorflow中RNNcell源碼分析以及自定義RNNCell的方法
- TacotronEncoderCell的參數之一EncoderConvolutions,對應於3 Conv Layers:
with tf.variable_scope(self.scope): x = inputs for i in range(self.enc_conv_num_layers): x = conv1d(x, self.kernel_size, self.channels, self.activation, self.is_training, self.drop_rate, 'conv_layer_{}_'.format(i + 1) + self.scope) return x
- TacotronEncoderCell的參數之二EncoderRNN,對應於Bidirectional LSTM:
with tf.variable_scope(self.scope): outputs, (fw_state, bw_state) = tf.nn.bidirectional_dynamic_rnn( self._fw_cell, self._bw_cell, inputs, sequence_length=input_lengths, dtype=tf.float32, swap_memory=True) # Concat and return forward + backward outputs return tf.concat(outputs, axis=2)
其中,self._fw_cell和self._bw_cell均為自定義的RNNCell,ZoneoutLSTMCell。這是一種最近出現的LSTM變種,參見:Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations
-
-
Decoder,上圖中,2 Layer Pre-Net、Location Sensitive Attention、2 LSTM Layers和Linear Projection:解碼器
[batch_size, encoder_steps, encoder_lstm_units] -> [batch_size, decoder_steps, num_mels×r]
# Attention Decoder Prenet prenet = Prenet(is_training, layers_sizes=hp.prenet_layers, drop_rate=hp.tacotron_dropout_rate, scope='decoder_prenet') # Attention Mechanism attention_mechanism = LocationSensitiveAttention(hp.attention_dim, encoder_outputs, hparams=hp, mask_encoder=hp.mask_encoder, memory_sequence_length=input_lengths, smoothing=hp.smoothing, cumulate_weights=hp.cumulative_weights) # Decoder LSTM Cells decoder_lstm = DecoderRNN(is_training, layers=hp.decoder_layers, size=hp.decoder_lstm_units, zoneout=hp.tacotron_zoneout_rate, scope='decoder_lstm')
-
Prenet:2 Layer Pre-Net,Dense + Dropout
x = inputs with tf.variable_scope(self.scope): for i, size in enumerate(self.layers_sizes): dense = tf.layers.dense(x, units=size, activation=self.activation, name='dense_{}'.format(i + 1)) # The paper discussed introducing diversity in generation at inference time # by using a dropout of 0.5 only in prenet layers (in both training and inference). x = tf.layers.dropout(dense, rate=self.drop_rate, training=True, name='dropout_{}'.format(i + 1) + self.scope) return x
-
LocationSensitiveAttention:Location Sensitive Attention
繼承自from tensorflow.contrib.seq2seq.python.ops.attention_wrapper import BahdanauAttention的子類
Location Sensitive Attention是稍微改動的混合注意力機制:
-
energy
\[e_{ij}=v_a^Ttanh(Ws_i+Vh_j+Uf_{i,j}+b) \]其中,\(v_a^T\)、\(W\)、\(V\)、\(U\)和\(b\)為待訓練參數,\(s_i\)為當前解碼步上RNN隱狀態,\(h_j\)為編碼器隱狀態,\(f_{i,j}\)經卷積的累加的之前的alignments
# processed_query shape [batch_size, query_depth] -> [batch_size, attention_dim] W_query = self.query_layer(query) if self.query_layer else query # -> [batch_size, 1, attention_dim] W_query = tf.expand_dims(processed_query, axis=1) # processed_location_features shape [batch_size, max_time, attention dimension] # [batch_size, max_time] -> [batch_size, max_time, 1] expanded_alignments = tf.expand_dims(previous_alignments, axis=2) # location features [batch_size, max_time, filters] f = self.location_convolution(expanded_alignments) # Projected location features [batch_size, max_time, attention_dim] W_fil = self.location_layer(f) v_a = tf.get_variable( 'attention_variable', shape=[num_units], dtype=dtype, initializer=tf.contrib.layers.xavier_initializer()) b_a = tf.get_variable( 'attention_bias', shape=[num_units], dtype=dtype, initializer=tf.zeros_initializer()) tf.reduce_sum(v_a * tf.tanh(W_keys + W_query + W_fil + b_a), axis=[2])
其中,\(f_{i,j}\to W\_fil\):self.location_convolution & self.location_layer:
self.location_convolution = tf.layers.Conv1D(filters=hparams.attention_filters, kernel_size=hparams.attention_kernel, padding='same', use_bias=True, bias_initializer=tf.zeros_initializer(), name='location_features_convolution') self.location_layer = tf.layers.Dense(units=num_units, use_bias=False, dtype=tf.float32, name='location_features_layer')
alignments | attention weights
\[\alpha_{ij}=\frac{exp(e_{ij})}{\sum_{k=1}^{T_x}exp(e_{ik})} \]alignments = self._probability_fn(energy, previous_alignments)
-
context vector
\[c_{i}=\sum_{j=1}^{T_x}\alpha_{ij}h_j \]context = math_ops.matmul(expanded_alignments, attention_mechanism.values) context = array_ops.squeeze(context, axis=[1])
-
-
FrameProjection & StopProjection:Linear Projection,Dense
-
FrameProjection:
with tf.variable_scope(self.scope): #If activation==None, this returns a simple Linear projection #else the projection will be passed through an activation function # output = tf.layers.dense(inputs, units=self.shape, activation=self.activation, # name='projection_{}'.format(self.scope)) output = self.dense(inputs) return output
-
StopProjection:
with tf.variable_scope(self.scope): output = tf.layers.dense(inputs, units=self.shape, activation=None, name='projection_{}'.format(self.scope)) #During training, don't use activation as it is integrated inside the sigmoid_cross_entropy loss function if self.is_training: return output return self.activation(output)
-
-
在Decoder的實現中,將實例化的prenet、attention_mechanism、decoder_lstm、frame_projection和stop_projection傳入TacotronDecoderCell:
decoder_cell = TacotronDecoderCell(
prenet,
attention_mechanism,
decoder_lstm,
frame_projection,
stop_projection)
其中,TacotronDecoderCell繼承自from tensorflow.contrib.rnn import RNNCell,
#Information bottleneck (essential for learning attention)
prenet_output = self._prenet(inputs)
#Concat context vector and prenet output to form LSTM cells input (input feeding)
LSTM_input = tf.concat([prenet_output, state.attention], axis=-1)
#Unidirectional LSTM layers
LSTM_output, next_cell_state = self._cell(LSTM_input, state.cell_state)
#Compute the attention (context) vector and alignments using
#the new decoder cell hidden state as query vector
#and cumulative alignments to extract location features
#The choice of the new cell hidden state (s_{i}) of the last
#decoder RNN Cell is based on Luong et Al. (2015):
#https://arxiv.org/pdf/1508.04025.pdf
previous_alignments = state.alignments
previous_alignment_history = state.alignment_history
context_vector, alignments, cumulated_alignments = _compute_attention(self._attention_mechanism,
LSTM_output,
previous_alignments,
attention_layer=None)
#Concat LSTM outputs and context vector to form projections inputs
projections_input = tf.concat([LSTM_output, context_vector], axis=-1)
#Compute predicted frames and predicted <stop_token>
cell_outputs = self._frame_projection(projections_input)
stop_tokens = self._stop_projection(projections_input)
#Save alignment history
alignment_history = previous_alignment_history.write(state.time, alignments)
#Prepare next decoder state
next_state = TacotronDecoderCellState(
time=state.time + 1,
cell_state=next_cell_state,
attention=context_vector,
alignments=cumulated_alignments,
alignment_history=alignment_history)
return (cell_outputs, stop_tokens), next_state
然后定義helper,初始化后,開始解碼:
#Define the helper for our decoder
if is_training or is_evaluating or gta:
self.helper = TacoTrainingHelper(batch_size, mel_targets, stop_token_targets, hp, gta, is_evaluating, global_step)
else:
self.helper = TacoTestHelper(batch_size, hp)
#initial decoder state
decoder_init_state = decoder_cell.zero_state(batch_size=batch_size, dtype=tf.float32)
#Only use max iterations at synthesis time
max_iters = hp.max_iters if not (is_training or is_evaluating) else None
#Decode
(frames_prediction, stop_token_prediction, _), final_decoder_state, _ = dynamic_decode(
CustomDecoder(decoder_cell, self.helper, decoder_init_state),
impute_finished=False,
maximum_iterations=max_iters,
swap_memory=hp.tacotron_swap_with_cpu)
# Reshape outputs to be one output per entry
#==> [batch_size, non_reduced_decoder_steps (decoder_steps * r), num_mels]
decoder_output = tf.reshape(frames_prediction, [batch_size, -1, hp.num_mels])
stop_token_prediction = tf.reshape(stop_token_prediction, [batch_size, -1])
Seq2Seq定義簡明說明參見:Tensorflow新版Seq2Seq接口使用、Dynamic Decoding-Tensorflow
-
Postnet,上圖中,5 Conv Layer Post-Net:后處理網絡
[batch_size, decoder_steps, num_mels×r] -> [batch_size, decoder_steps×r, postnet_channels]
with tf.variable_scope(self.scope): x = inputs for i in range(self.postnet_num_layers - 1): x = conv1d(x, self.kernel_size, self.channels, self.activation, self.is_training, self.drop_rate, 'conv_layer_{}_'.format(i + 1)+self.scope) x = conv1d(x, self.kernel_size, self.channels, lambda _: _, self.is_training, self.drop_rate, 'conv_layer_{}_'.format(5)+self.scope) return x
之后進入殘差部分:
[batch_size, decoder_steps×r, postnet_channels] -> [batch_size, decoder_steps×r, num_mels]
residual_projection = Dense(hp.num_mels, scope='postnet_projection') projected_residual = residual_projection(residual) mel_outputs = decoder_output + projected_residual
此外,代碼中還提供了“開關”后處理CBHG的選項:
if post_condition: # Add post-processing CBHG: post_outputs = post_cbhg(mel_outputs, hp.num_mels, is_training) # [N, T_out, 256] linear_outputs = tf.layers.dense(post_outputs, hp.num_freq)
-
Loss計算
self.loss = self.before_loss + self.after_loss + self.stop_token_loss + self.regularization_loss + self.linear_loss
-
self.before_loss & self.after_loss
送入Post-Net前的MaskedMSE & 經過Post-Net和殘差之后的MaskedMSE,MaskedMSE定義如下:
def MaskedMSE(targets, outputs, targets_lengths, hparams, mask=None): '''Computes a masked Mean Squared Error ''' #[batch_size, time_dimension, 1] #example: #sequence_mask([1, 3, 2], 5) = [[[1., 0., 0., 0., 0.]], # [[1., 1., 1., 0., 0.]], # [[1., 1., 0., 0., 0.]]] #Note the maxlen argument that ensures mask shape is compatible with r>1 #This will by default mask the extra paddings caused by r>1 if mask is None: mask = sequence_mask(targets_lengths, hparams.outputs_per_step, True) #[batch_size, time_dimension, channel_dimension(mels)] ones = tf.ones(shape=[tf.shape(mask)[0], tf.shape(mask)[1], tf.shape(targets)[-1]], dtype=tf.float32) mask_ = mask * ones with tf.control_dependencies([tf.assert_equal(tf.shape(targets), tf.shape(mask_))]): return tf.losses.mean_squared_error(labels=targets, predictions=outputs, weights=mask_)
即:在計算預測和Ground-Truth序列的均方差時,對超過Ground-Truth長度處的方差置0。
-
self.stop_token_loss
Ground-Truth的stop-token概率和預測的stop-token概率兩者的交叉熵,stop-token是通過一個簡單的邏輯回歸預測,超過0.5則停止序列生成。和MaskedMSE類似,MaskedSigmoidCrossEntropy定義如下:
def MaskedSigmoidCrossEntropy(targets, outputs, targets_lengths, hparams, mask=None): '''Computes a masked SigmoidCrossEntropy with logits ''' #[batch_size, time_dimension] #example: #sequence_mask([1, 3, 2], 5) = [[1., 0., 0., 0., 0.], # [1., 1., 1., 0., 0.], # [1., 1., 0., 0., 0.]] #Note the maxlen argument that ensures mask shape is compatible with r>1 #This will by default mask the extra paddings caused by r>1 if mask is None: mask = sequence_mask(targets_lengths, hparams.outputs_per_step, False) with tf.control_dependencies([tf.assert_equal(tf.shape(targets), tf.shape(mask))]): #Use a weighted sigmoid cross entropy to measure the <stop_token> loss. Set hparams.cross_entropy_pos_weight to 1 #will have the same effect as vanilla tf.nn.sigmoid_cross_entropy_with_logits. losses = tf.nn.weighted_cross_entropy_with_logits(targets=targets, logits=outputs, pos_weight=hparams.cross_entropy_pos_weight) with tf.control_dependencies([tf.assert_equal(tf.shape(mask), tf.shape(losses))]): masked_loss = losses * mask return tf.reduce_sum(masked_loss) / tf.count_nonzero(masked_loss, dtype=tf.float32)
-
self.regularization_loss
所有變量的正則化Loss
# Get all trainable variables all_vars = tf.trainable_variables() regularization = tf.add_n([tf.nn.l2_loss(v) for v in all_vars if not ('bias' in v.name or 'Bias' in v.name)]) * reg_weight
-
self.linear_loss
當特征網絡預測線性譜時,特有的Loss。特別加大低於2000Hz線性譜的Ground-Truth和預測的Loss,這是因為低頻對理解音頻十分重要
def MaskedLinearLoss(targets, outputs, targets_lengths, hparams, mask=None): '''Computes a masked MAE loss with priority to low frequencies ''' #[batch_size, time_dimension, 1] #example: #sequence_mask([1, 3, 2], 5) = [[[1., 0., 0., 0., 0.]], # [[1., 1., 1., 0., 0.]], # [[1., 1., 0., 0., 0.]]] #Note the maxlen argument that ensures mask shape is compatible with r>1 #This will by default mask the extra paddings caused by r>1 if mask is None: mask = sequence_mask(targets_lengths, hparams.outputs_per_step, True) #[batch_size, time_dimension, channel_dimension(freq)] ones = tf.ones(shape=[tf.shape(mask)[0], tf.shape(mask)[1], tf.shape(targets)[-1]], dtype=tf.float32) mask_ = mask * ones l1 = tf.abs(targets - outputs) n_priority_freq = int(2000 / (hparams.sample_rate * 0.5) * hparams.num_freq) with tf.control_dependencies([tf.assert_equal(tf.shape(targets), tf.shape(mask_))]): masked_l1 = l1 * mask_ masked_l1_low = masked_l1[:,:,0:n_priority_freq] mean_l1 = tf.reduce_sum(masked_l1) / tf.reduce_sum(mask_) mean_l1_low = tf.reduce_sum(masked_l1_low) / tf.reduce_sum(mask_) return 0.5 * mean_l1 + 0.5 * mean_l1_low
可以閱讀hp.mask_decoder == False時的Loss計算,即不對decoder進行Mask,更容易理解計算整個網絡Loss的方法。
-
-
優化
使用AdamOptimizer優化,值得注意的是,默認學習率策略:
-
< 50k steps: lr = 1e-3
-
[50k, 310k] steps: lr = 1e-3 ~ 1e-5,指數下降:tf.train.exponential_decay
-
> 310 k steps: lr = 1e-5
# Compute natural exponential decay lr = tf.train.exponential_decay(learning_rate = init_lr, global_step = global_step - hp.tacotron_start_decay, # lr = 1e-3 at step 50k decay_steps = self.decay_steps, decay_rate = self.decay_rate, # lr = 1e-5 around step 310k name='lr_exponential_decay') # clip learning rate by max and min values (initial and final values) return tf.minimum(tf.maximum(lr, hp.tacotron_final_learning_rate), init_lr)
學習率指數衰減:
decayed_learning_rate=learining_rate*decay_rate^(global_step/decay_steps)
-
ClariNet
ClariNet改進重點在WaveNet。論文地址:ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
論文貢獻:
- 單高斯簡化parallel WaveNet的KL目標函數,改進了蒸餾法(distillation)算法,使得結構更簡單,更穩定
- 通過Bridge-net連接了Tacotron(特征預測網絡)和WaveNet,徹底端到端
前置知識
-
KL散度
\[\begin{align*} KL(P||Q)&=\int_{-\infty }^{+\infty}p(x)log\frac{p(x)}{q(x)}dx\\ &=\int_{-\infty}^{+\infty}p(x)logp(x)-\int_{-\infty}^{+\infty}p(x)logq(x)\\ &=-H(P)+H(P,Q) \end{align*} \]其中,\(-H(P)\)為固定值,\(H(P,Q)\)為交叉熵。最小化\(KL(P||Q)\),等價於最小化交叉熵。散度是差值的意思,散度越小,兩個分布越相像。
KL散度是非負的,這種度量是不對稱的,也即是\(KL(P||Q)\neq KL(Q||P)\)
現在希望使用簡單分布\(q\)擬合復雜分布\(p\),使用\(KL(q||p)\)做最優化,是希望p(x)為0的地方q(x)也為0,否則
\(\frac{q(x)}{p(x)}\)會很大(上圖中右側兩張圖);使用\(KL(p||q)\)做最優化,就是盡量避免p(x)不為0而q(x)用0去擬合,上圖中最左側的圖。因此,\(KL(q||p)\)得到的近似分布\(q(x)\)會比較窄,因為其希望\(q(x)\)為0的地方比較多(上圖右側兩圖);而\(KL(p||q)\)得到的近似分布\(q(x)\)會比較寬,因為其希望\(q(x)\)為0的地方比較多(上圖最左側圖)。
由於\(KL(q||p)\)至少可以擬合到其中一個峰上,而\(KL(p||q)\)擬合的結果,其概率密度最大的地方可能沒有什么意義,因此在通常情況下,使用簡單分布\(q\)擬合復雜分布\(p\)時,\(KL(q||p)\)更符合我們的需要。
-
變分推斷
變分推斷的核心思想是:用形式簡單的分布去近似形式復雜、不易計算的分布。比如,我們可以在指數族函數空間當中,選一個和目標分布最相像的分布,這樣計算起來就方便多了。所謂“變分”,就是從函數空間中找到滿足某些條件或約束的函數。
參見:PRML讀書會第十章 Approximate Inference(近似推斷,變分推斷,KL散度,平均場, Mean Field )
之前的WaveNet可以根據任意形式的輸入直接合成波形,結構是純卷積的,沒有循環層,在訓練時可以在采樣點的級別上並行,效率很高;但是在預測時是自回歸的,即每個采樣點都依賴於之前的采樣點,效率很低。
這里構建“學生”和“老師”兩種網絡,作為學生的網絡一般比老師的網絡結構更簡單,但是希望憑借學生這種簡單的結構學到老師的精華,這種在網絡間傳授知識的過程稱作“蒸餾”(distill)。一般來說,蒸餾的過程是讓學生網絡的輸出分布盡可能逼近老師網絡的輸出分布,這一般是通過最小化兩個網絡輸出分布之間的KL散度實現的。
之前WaveNet的輸出層是一個分類層,輸出0~255的離散樣本值,這不適合從老師傳授給學生,於是parallel WaveNet的作者們將輸出改造為連續的mixture of logistics(MoL)分布,而學生網絡的輸出分布使用的是logistic inverse auto-regressive flow,學生網絡雖然也稱“auto-regressive”,但實際上合成是並行的。parallel WaveNet的蒸餾過程使用了大量的logistic函數,這導致學生網絡和老師網絡的輸出分布的KL散度很難計算,不得不用蒙特卡洛方法近似,於是網絡的蒸餾過程容易出現數值不穩定的情況。
ClariNet將老師的輸出分布簡化為單高斯分布,並通過實驗證明這並不影響合成效果。學生的輸出分布改為了高斯逆回歸流(Gaussian IAF),簡化的好處在於兩邊輸出分布的KL散度能夠寫出顯式表達式,蒸餾過程可以通過正常的梯度下降法來完成。
論文中,使用一種改進的KL散度衡量學生和老師網絡輸出分布的差異: