官方博客 WaveNet: A Generative Model for Raw Audio
paper地址:paper
Abstract
WaveNet是probabilistic and autoregressive的生成,對每個預測的audio sample的分布都基於前面的前面的sample分布。在TTS的應用中,能達到state_of_art的效果,聽覺感受上優於parametric and concatenative的系統。同時系統還可以生成音樂,作為discriminative model對phoneme做識別
Introduction
受neural autore-gressive generative models生成圖像的啟發[1][2]來生成wideband raw audio waveforms,主要挑戰在於每秒采樣率高達16,000 samples。
Contributions
1.We show that WaveNets can generate raw speech signals with subjective naturalness never before reported in the field of text-to-speech (TTS), as assessed by human raters。
2. In order to deal with long-range temporal dependencies needed for raw audio generation,we develop new architectures based on dilated causal convolutions, which exhibit very large receptive fields.(超大感受野)
3.We show that when conditioned on a speaker identity, a single model can be used to generate different voices
4.The same architecture shows strong results when tested on a small speech recognition dataset, and is promising when used to generate other audio modalities such as music
WaveNet
概率模型:
 Each audio sample xt is therefore conditioned on the samples at all previous timesteps, the conditional probability distribution is modelled by a stack of convolutional layers.
The model outputs a categorical distribution over the next value Xt  with a softmax layer and it is optimized to maximize the log-likelihood of the data w.r.t.  the parameters.  Because log-likelihoods are tractable, we tune hyper-
parameters on a validation set and can easily measure if the model is overfitting or underfitting
- DILATED CAUSAL CONVOLUTIONS:
 

The main ingredient of WaveNet are causal convolution。Because models with causal convolutions do not have recurrent connections, they are typically faster to train than RNNs, especially when applied to very long sequences(只有因果卷積,而沒有遞歸連接)。One of the problems of causal convolutions is that they require many layers, or large filters to increase the receptive field. For example, in Fig. 2 the receptive field is only 5 (= #layers + filter length - 1)
A dilated convolution (also called a trous, or convolution with holes) is a convolution where the filter is applied over an area larger than its length by skipping input values with a certain step。Stacked dilated convolutions enable networks to have very large receptive fields with just a few layers, while preserving the input resolution throughout the network as well as computational efficiency

- SOFTMAX DISTRIBUTIONS
 
Softmax distribution tends to work better to modeling the conditional distributions over the individual audio samples。Because raw audio is typically stored as a sequence of 16-bit integer values (one per timestep), a
softmax layer would need to output 65,536 probabilities per timestep to model all possible value.這里用[3]提出的$\{mu}-law$方法做了一個量化壓縮,將輸出概率數目壓縮到了256個可能的值

 這個非線性壓縮變換后的聲音效果和原聲音相差不大
- GATED ACTIVATION UNITS
 
使用了和[1]相同的gated激活單元:

- RESIDUAL AND SKIP CONNECTIONS
 
Both resudula and skip method are used to speed up convergence and enable training of much deeper model
  
- CONDITIONAL WAVE NET
 

For example, in a multi-speaker setting we can choose the speaker by feeding the speaker identity to the model as an extra input. Similarly, for TTS we need to feed information about the text as an extra input.
Global conditioning is characterised by a single latent representation h that influences the output distribution across all timestep:



模型中采用的擴大卷積的方法來極大的增加感受野,對序列數據建模很有用
[1]van den Oord, A ̈ aron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks.
[2]J ́ ozefowicz, Rafal, Vinyals, Oriol, Schuster, Mike, Shazeer, Noam, and Wu, Yonghui. Exploring the
[3] ITU-T. Recommendation G. 711. Pulse Code Modulation (PCM) of voice frequencies, 1988
