WaveNet: 原始音頻生成模型

本文轉載自查看原文 2018-03-21 11:33 4348

官方博客 WaveNet: A Generative Model for Raw Audio

paper地址：paper

Abstract

WaveNet是probabilistic and autoregressive的生成，對每個預測的audio sample的分布都基於前面的前面的sample分布。在TTS的應用中，能達到state_of_art的效果，聽覺感受上優於parametric and concatenative的系統。同時系統還可以生成音樂，作為discriminative model對phoneme做識別

Introduction

受neural autore-gressive generative models生成圖像的啟發[1][2]來生成wideband raw audio waveforms，主要挑戰在於每秒采樣率高達16,000 samples。

Contributions

1.We show that WaveNets can generate raw speech signals with subjective naturalness never before reported in the field of text-to-speech (TTS), as assessed by human raters。

2. In order to deal with long-range temporal dependencies needed for raw audio generation,we develop new architectures based on dilated causal convolutions, which exhibit very large receptive fields.(超大感受野)

3.We show that when conditioned on a speaker identity, a single model can be used to generate different voices

4.The same architecture shows strong results when tested on a small speech recognition dataset, and is promising when used to generate other audio modalities such as music

WaveNet

概率模型： Each audio sample x_tis therefore conditioned on the samples at all previous timesteps, the conditional probability distribution is modelled by a stack of convolutional layers.

The model outputs a categorical distribution over the next value X_twith a softmax layer and it is optimized to maximize the log-likelihood of the data w.r.t. the parameters. Because log-likelihoods are tractable, we tune hyper-
parameters on a validation set and can easily measure if the model is overfitting or underfitting

DILATED CAUSAL CONVOLUTIONS：

The main ingredient of WaveNet are causal convolution。Because models with causal convolutions do not have recurrent connections, they are typically faster to train than RNNs, especially when applied to very long sequences（只有因果卷積，而沒有遞歸連接）。One of the problems of causal convolutions is that they require many layers, or large filters to increase the receptive field. For example, in Fig. 2 the receptive field is only 5 (= #layers + filter length - 1)

A dilated convolution (also called a trous, or convolution with holes) is a convolution where the filter is applied over an area larger than its length by skipping input values with a certain step。Stacked dilated convolutions enable networks to have very large receptive fields with just a few layers, while preserving the input resolution throughout the network as well as computational efficiency

SOFTMAX DISTRIBUTIONS

Softmax distribution tends to work better to modeling the conditional distributions over the individual audio samples。Because raw audio is typically stored as a sequence of 16-bit integer values (one per timestep), a
softmax layer would need to output 65,536 probabilities per timestep to model all possible value.這里用[3]提出的$\{mu}-law$方法做了一個量化壓縮，將輸出概率數目壓縮到了256個可能的值

這個非線性壓縮變換后的聲音效果和原聲音相差不大

GATED ACTIVATION UNITS

使用了和[1]相同的gated激活單元：

RESIDUAL AND SKIP CONNECTIONS

　　Both resudula and skip method are used to speed up convergence and enable training of much deeper model

CONDITIONAL WAVE NET

For example, in a multi-speaker setting we can choose the speaker by feeding the speaker identity to the model as an extra input. Similarly, for TTS we need to feed information about the text as an extra input.

Global conditioning is characterised by a single latent representation h that influences the output distribution across all timestep:

模型中采用的擴大卷積的方法來極大的增加感受野，對序列數據建模很有用

[1]van den Oord, A ̈ aron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks.

[2]J ́ ozefowicz, Rafal, Vinyals, Oriol, Schuster, Mike, Shazeer, Noam, and Wu, Yonghui. Exploring the

[3] ITU-T. Recommendation G. 711. Pulse Code Modulation (PCM) of voice frequencies, 1988

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Matlab_audiowrite_音頻生成 Graph-WaveNet 訓練數據的生成加代碼注釋根據詞頻生成詞雲(Python wordcloud實現) 如何錄制視頻生成GIF動態圖? 一文帶你讀懂深度學習之Deepmind WaveNet模型和Keras實現已知詞頻生成詞雲圖（數據庫到生成詞雲）--generate_from_frequencies（WordCloud）生成模型和判別模型 Java中動態代理技術生成的類與原始類的區別通過google cloud API 使用 WaveNet 生成式模型